This change simply documents the existing "scripted mode" option in
both command help and man page.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7798
This change allows the arcstat.py script to handle unsupported options
gracefully and print both error and usage messages when one such option
is provided.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7799
When running ztest, never suspend the pool due to failed or delayed
MMP writes.
There are many sources of long delays within ztest, such as device
opens, closes, etc. which in combination, may delay MMP writes too
long and cause MMP to suspend the pool.
Some of these delays also affect real pools, and should be fixed.
That is being worked separately.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#7776
- Add two new module parameters to icp (icp_aes_impl, icp_gcm_impl)
that control the crypto implementation. At the moment there is a
choice between generic and aesni (on platforms that support it).
- This enables support for AES-NI and PCLMULQDQ-NI on AMD Family
15h (bulldozer) and newer CPUs (zen).
- Modify aes_key_t to track what implementation it was generated
with as key schedules generated with various implementations
are not necessarily interchangable.
Reviewed by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Nathaniel R. Lewis <linux.robotdude@gmail.com>
Closes#7102Closes#7103
= Motivation
While dealing with another performance issue (see 126118f) we noticed
that we spend a lot of time in various places in the kernel when
constructing long nvlists. The problem is that when an nvlist is created
with the NV_UNIQUE_NAME set (which is the case most of the time), we do
a linear search through the whole list to ensure uniqueness for every
entry we add.
An example of the above scenario can be seen in the following
flamegraph, where more than have the time of the zfsdev_ioctl() is spent
on constructing nvlists. Flamegraph:
https://sdimitro.github.io/img/flame/sdimitro_snap_unmount3.svg
Adding a table to speed up lookups will help situations where we just
construct an nvlist (like the scenario above), in addition to regular
lookups and removals.
= What this patch does
In this diff we've implemented a hash-table on top of the nvlist code
that converts most nvlist operations from O(# number of entries) to
O(1)* (the start is for amortized time as the hash-table grows and
shrinks depending on the # of entries - plain lookup is strictly O(1)).
= Performance Analysis
To analyze the performance improvement I just used the setup from the
snapshot deletion issue mentioned above in the Motivation section.
Basically I created 10K filesystems with one snapshot each and then I
just used the API of libZFS_Core to pass down an nvlist of all the
snapshots to have them deleted. The reason I used my own driver program
was to have clean performance results of what actually happens in the
kernel. The flamegraphs and wall clock times mentioned below were
gathered from the start to the end of the driver program's run. Between
trials the testpool used was completely destroyed, the system was
rebooted and the testpool was completely recreated. The reason for this
dance was to get consistent results.
== Results (before patch):
=== Sampling Flamegraphs
[Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A.svg
[Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2.svg
[Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3.svg
=== Wall clock times (in seconds)
```
[Trial 4]
real 5.3
user 0.4
sys 2.3
[Trial 5]
real 8.2
user 0.4
sys 2.4
[Trial 6]
real 6.0
user 0.5
sys 2.3
```
== Results (after patch):
=== Sampling Flamegraphs
[Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-Ae.svg
[Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2e.svg
[Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3e.svg
=== Wall clock times (in seconds)
```
[Trial 4]
real 4.9
user 0.0
sys 0.9
[Trial 5]
real 3.8
user 0.0
sys 0.9
[Trial 6]
real 3.6
user 0.0
sys 0.9
```
== Analysis
The results between the trials are consistent so in this sections I will
only talk about the flamegraph results from trial-1 and the wall-clock
results from trial-4.
From trial-1 we can see that zfs_dev_ioctl() goes from 2,331 to 996
samples counts. Specifically, the samples from fnvlist_add_nvlist() and
spa_history_log_nvl() are almost gone (~500 & ~800 to 5 & 5 samples),
leaving zfs_ioc_destroy_snaps() to dominate most samples from
zfs_dev_ioctl().
From trial-4 we see that the user time dropped to 0 secods. I believe
the consistent 0.4 seconds before my patch was applied was due to my
driver program constructing the long nvlist of snapshots so it can pass
it to the kernel. As for the system time, the effect there is more clear
(2.3 down to 0.9 seconds).
Porting Notes:
* DATA_TYPE_DONTCARE case added to switch in fm_nvprintr() and
zpool_do_events_nvprint().
Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/9580
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b5eca7b1Closes#7748
While the autoexpand property may seem like a small feature it
depends on a significant amount of system infrastructure. Enough
of that infrastructure is now in place that with a few modifications
for Linux it can be supported.
Auto-expand works as follows; when a block device is modified
(re-sized, closed after being open r/w, etc) a change uevent is
generated for udev. The ZED, which is monitoring udev events,
passes the change event along to zfs_deliver_dle() if the disk
or partition contains a zfs_member as identified by blkid.
From here the device is matched against all imported pool vdevs
using the vdev_guid which was read from the label by blkid. If
a match is found the ZED reopens the pool vdev. This re-opening
is important because it allows the vdev to be briefly closed so
the disk partition table can be re-read. Otherwise, it wouldn't
be possible to report the maximum possible expansion size.
Finally, if the property autoexpand=on a vdev expansion will be
attempted. After performing some sanity checks on the disk to
verify that it is safe to expand, the primary partition (-part1)
will be expanded and the partition table updated. The partition
is then re-opened (again) to detect the updated size which allows
the new capacity to be used.
In order to make all of the above possible the following changes
were required:
* Updated the zpool_expand_001_pos and zpool_expand_003_pos tests.
These tests now create a pool which is layered on a loopback,
scsi_debug, and file vdev. This allows for testing of non-
partitioned block device (loopback), a partition block device
(scsi_debug), and a file which does not receive udev change
events. This provided for better test coverage, and by removing
the layering on ZFS volumes there issues surrounding layering
one pool on another are avoided.
* zpool_find_vdev_by_physpath() updated to accept a vdev guid.
This allows for matching by guid rather than path which is a
more reliable way for the ZED to reference a vdev.
* Fixed zfs_zevent_wait() signal handling which could result
in the ZED spinning when a signal was not handled.
* Removed vdev_disk_rrpart() functionality which can be abandoned
in favor of kernel provided blkdev_reread_part() function.
* Added a rwlock which is held as a writer while a disk is being
reopened. This is important to prevent errors from occurring
for any configuration related IOs which bypass the SCL_ZIO lock.
The zpool_reopen_007_pos.ksh test case was added to verify IO
error are never observed when reopening. This is not expected
to impact IO performance.
Additional fixes which aren't critical but were discovered and
resolved in the course of developing this functionality.
* Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for
ZFS volumes. This is as good as a unique physical path, while the
volumes are not used in the test cases anymore for other reasons
this improvement was included.
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#120Closes#2437Closes#5771Closes#7366Closes#7582Closes#7629
When running zdb without additional arguments against a pool containing
a checkpoint the entire checkpoint spacemap should not be dumped. Make
this behavior conditional upon passing the -mmmm option as described in
the zdb(8) man page.
-mmmm Display every spacemap record.
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7702
Add a default 4 KiB ashift for Amazon EC2 NVMe devices on instances with
NVMe ephemeral devices, such as the types c5d, f1, i3 and m5d.
As per the official documentation [1] a 4096 byte blocksize should be
used to match the underlying hardware.
The string was identified via:
$ sudo sginfo -M /dev/nvme0n1
INQUIRY response (cmd: 0x12)
----------------------------
Device Type 0
Vendor: NVMe
Product: Amazon EC2 NVMe
Revision level:
$ lsblk -io KNAME,TYPE,SIZE,MODEL
KNAME TYPE SIZE MODEL
nvme0n1 disk 442.4G Amazon EC2 NVMe Instance Storage
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/
storage-optimized-instances.html
Retrived 2018-07-03
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Troels Nørgaard <tnn@tradeshift.com>
Closes#7676
Motivation
==========
The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).
New encoding
============
The layout can be found at space_map.h and it remains backwards compatible with
the old one. The introduced two-word entry format, besides extending the limits
imposed by the single-entry layout, also includes a vdev field and some extra
padding after its prefix.
The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps (expected to be used in the log spacemap project).
One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and we wanted to fit the vdev field
in the space map. In addition that gives us some extra bits in dva_t.
Other references:
=================
The new encoding is also discussed towards the end of the Log Space Map
presentation from 2017's OpenZFS summit.
Link: https://www.youtube.com/watch?v=jj2IxRkl5bQ
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90a56e6d
OpenZFS-issue: https://www.illumos.org/issues/9238Closes#7665
Details about the motivation of this feature and its usage can
be found in this blogpost:
https://sdimitro.github.io/post/zpool-checkpoint/
A lightning talk of this feature can be found here:
https://www.youtube.com/watch?v=fPQA8K40jAM
Implementation details can be found in big block comment of
spa_checkpoint.c
Side-changes that are relevant to this commit but not explained
elsewhere:
* renames members of "struct metaslab trees to be shorter without
losing meaning
* space_map_{alloc,truncate}() accept a block size as a
parameter. The reason is that in the current state all space
maps that we allocate through the DMU use a global tunable
(space_map_blksz) which defauls to 4KB. This is ok for metaslab
space maps in terms of bandwirdth since they are scattered all
over the disk. But for other space maps this default is probably
not what we want. Examples are device removal's vdev_obsolete_sm
or vdev_chedkpoint_sm from this review. Both of these have a
1:1 relationship with each vdev and could benefit from a bigger
block size.
Porting notes:
* The part of dsl_scan_sync() which handles async destroys has
been moved into the new dsl_process_async_destroys() function.
* Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write
to block device backed pools.
* ZTS:
* Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg".
* Don't use large dd block sizes on /dev/urandom under Linux in
checkpoint_capacity.
* Adopt Delphix-OS's setting of 4 (spa_asize_inflation =
SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed
its attempts to fill the pool
* Create the base and nested pools with sync=disabled to speed up
the "setup" phase.
* Clear labels in test pool between checkpoint tests to avoid
duplicate pool issues.
* The import_rewind_device_replaced test has been marked as "known
to fail" for the reasons listed in its DISCLAIMER.
* New module parameters:
zfs_spa_discard_memory_limit,
zfs_remove_max_bytes_pause (not documented - debugging only)
vdev_max_ms_count (formerly metaslabs_per_vdev)
vdev_min_ms_count
Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9166
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7159fdb8Closes#7570
This patch adds tunables for modifying the maximum memory limit and
maximum instruction limit that can be specified when running a channel
program.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1085
Closes#7618
Commit 2ffd89fc allowed two new errors to be reported by zil_reset()
in order to provide a descriptive error message regarding why a log
device could not be removed. However, the new return values were
not handled in the ztest_vdev_add_remove() test case resulting in
ztest failures during automated testing.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7630
zpool and zed place scripts in subdirectories of libexecdir. Some
distributions locate architecture independent scripts in other locations
(e.g. Debian). To avoid these paths getting out of sync, centralize the
definitions.
Build zfs-test's default.cfg by Makefile. Use the new directory
logic building tests/zfs-tests/include/default.cfg.in.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes#7597
Currently, during a recursive zfs destroy the first error that is
encountered will stop the destruction of the datasets. Errors may
happen for a variety of reasons including competing deletions
and busy datasets.
This patch switches recursive destroy to always do a best-effort
recursive dataset destroy.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes#7574
1. Add a proc entry to display the pool's state:
$ cat /proc/spl/kstat/zfs/tank/state
ONLINE
This is done without using the spa config locks, so it will
never hang.
2. Fix 'zpool status' and 'zpool list -o health' output to print
"SUSPENDED" instead of "ONLINE" for suspended pools.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#7331Closes#7563
The only remaining consumer of the rwlock compatibility wrappers
is ztest. Remove the wrappers and convert the few remaining
calls to the underlying pthread functions.
rwlock_init() -> pthread_rwlock_init()
rwlock_destroy() -> pthread_rwlock_destroy()
rw_rdlock() -> pthread_rwlock_rdlock()
rw_wrlock() -> pthread_rwlock_wrlock()
rw_unlock() -> pthread_rwlock_unlock()
Note pthread_rwlock_init() defaults to PTHREAD_PROCESS_PRIVATE
which is equivilant to the USYNC_THREAD behavior. There is no
functional change.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7591
We want to be able to pass various settings during import/open of a
pool, which are not only related to rewind. Instead of adding a new
policy and duplicate a bunch of code, we should just rename
rewind_policy to a more generic term like load_policy.
For instance, we'd like to set spa->spa_import_flags from the nvlist,
rather from a flags parameter passed to spa_import as in some cases we
want those flags not only for the import case, but also for the open
case. One such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb)
which would allow zfs to open a pool when logs are missing.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9235
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d2b1e44Closes#7532
Minimal changes required to integrate the SPL sources in to the
ZFS repository build infrastructure and packaging.
Build system and packaging:
* Renamed SPL_* autoconf m4 macros to ZFS_*.
* Removed redundant SPL_* autoconf m4 macros.
* Updated the RPM spec files to remove SPL package dependency.
* The zfs package obsoletes the spl package, and the zfs-kmod
package obsoletes the spl-kmod package.
* The zfs-kmod-devel* packages were updated to add compatibility
symlinks under /usr/src/spl-x.y.z until all dependent packages
can be updated. They will be removed in a future release.
* Updated copy-builtin script for in-kernel builds.
* Updated DKMS package to include the spl.ko.
* Updated stale AUTHORS file to include all contributors.
* Updated stale COPYRIGHT and included the SPL as an exception.
* Renamed README.markdown to README.md
* Renamed OPENSOLARIS.LICENSE to LICENSE.
* Renamed DISCLAIMER to NOTICE.
Required code changes:
* Removed redundant HAVE_SPL macro.
* Removed _BOOT from nvpairs since it doesn't apply for Linux.
* Initial header cleanup (removal of empty headers, refactoring).
* Remove SPL repository clone/build from zimport.sh.
* Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due
to build issues when forcing C99 compilation.
* Replaced legacy ACCESS_ONCE with READ_ONCE.
* Include needed headers for `current` and `EXPORT_SYMBOL`.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
TEST_ZIMPORT_SKIP="yes"
Closes#7556
16MB alloc in zdb_embedded_block() can cause cores in certain
situations (clang, gcc55).
Authored by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:
* Replaces an equivalent fix previously made for Linux.
OpenZFS-issue: https://illumos.org/issues/9523
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2c1964aCloses#7561
lib/libzfs/libzfs_mount.c:zfs_add_options provides the canonical
mount options used by a `zfs mount` command. Because we cannot call
`zfs mount` directly from a systemd.mount unit, we mirror that logic
in zfs-mount-generator.
The zed script is updated to cache these properties as well.
Include a mini-tutorial in the manual page, properly substitute
configuration paths in zfs-mount-generator.8.in, and standardize the
Makefile.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes#7453
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:
7638 Refactor spa_load_impl into several functions
8961 SPA load/import should tell us why it failed
7277 zdb should be able to print zfs_dbgmsg's
To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.
The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.
The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.
When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.
This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.
With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.
Porting notes (ZTS):
* Fix 'make dist' target in zpool_import
* The maximum path length allowed by tar is 99 characters. Several
of the new test cases exceeded this limit resulting in them not
being included in the tarball. Shorten the names slightly.
* Set/get tunables using accessor functions.
* Get last synced txg via the "zfs_txg_history" mechanism.
* Clear zinject handlers in cleanup for import_cache_device_replaced
and import_rewind_device_replaced in order that the zpool can be
exported if there is an error.
* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
ext4 file system to be created on ZFS_DISK2. Also, there's
no need to partition ZFS_DISK2 at all. The partitioning had
already been disabled for multipath devices. Among other things,
the partitioning steals some space from the ext4 file system,
makes it difficult to accurately calculate the paramters to
parted and can make some of the tests fail.
* Increase FS_SIZE and FILE_SIZE in the zpool_import test
configuration now that FILESIZE is larger.
* Write more data in order that device evacuation take lonnger in
a couple tests.
* Use mkdir -p to avoid errors when the directory already exists.
* Remove use of sudo in import_rewind_config_changed.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123Closes#7459
Currently `zdb` consistently fails to examine non-idle pools as it
fails during the `spa_load()` process. The main problem seems to be
that `spa_load_verify()` fails as can be seen below:
$ sudo zdb -d -G dcenter
zdb: can't open 'dcenter': I/O error
ZFS_DBGMSG(zdb):
spa_open_common: opening dcenter
spa_load(dcenter): LOADING
disk vdev '/dev/dsk/c4t11d0s0': best uberblock found for spa dcenter. txg 40824950
spa_load(dcenter): using uberblock with txg=40824950
spa_load(dcenter): UNLOADING
spa_load(dcenter): RELOADING
spa_load(dcenter): LOADING
disk vdev '/dev/dsk/c3t10d0s0': best uberblock found for spa dcenter. txg 40824952
spa_load(dcenter): using uberblock with txg=40824952
spa_load(dcenter): FAILED: spa_load_verify failed [error=5]
spa_load(dcenter): UNLOADING
This change makes `spa_load_verify()` a dryrun when ran from
`zdb`. This is done by creating a global flag in zfs and then setting
it in `zdb`.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/8962
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/180ad792Closes#7459
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on
the deleted queue
It is possible for zfs to "leak" objects in such a way that they are not
freed, but are also not accessible via the POSIX interface. As the only
way to know that this is happened is to see one of them directly in a
zdb run, or by noting unaccounted space usage, zdb should be enhanced to
count these objects and return failure if some are detected.
We have access to the delete queue through the zfs_get_deleteq function;
we should call it in dump_znode to determine if the object is on the
delete queue. This is not the most efficient possible method, but it is
the simplest to implement, and should suffice for the common case where
there few objects on the delete queue.
Also zfs diff and zdb currently traverse every single dnode in a dataset
and tries to figure out the path of the object by following it's parent.
When an object is placed on the delete queue, for all practical purposes
it's already discarded, it's parent might not exist anymore, and another
object might now have the object number that belonged to the parent.
While all of the above makes sense, when trying to figure out the path
of an object that is on the delete queue, we can run into issues where
either it is impossible to determine the path because the parent is
gone, or another dnode has taken it's place and thus we are returned a
wrong path.
We should therefore avoid trying to determine the path of an object on
the delete queue and mark the object itself as being on the delete queue
to avoid confusion. To achieve this, we currently have two ideas:
1. When putting an object on the delete queue, change it's parent object
number to a known constant that means NULL.
2. When displaying objects, first check if it is present on the delete
queue.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9421
OpenZFS-issue: https://illumos.org/issues/9422
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9caCloses#7500
This patch adds the ability for zinject to trigger decryption
and authentication faults in the ZIO and ARC layers. This
functionality is exposed via the new "decrypt" error type, which
may be provided for "data" object types.
This patch also refactors some of the core encryption / decryption
functions so that they have consistent prototypes, handle errors
consistently, and do not have unused arguments.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7474
We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least,
metaslab_condense() should call zfs_dbgmsg because it's important and
rare enough to always log. It's possible that the message in
zio_dva_allocate() would be too high-frequency for zfs_dbgmsg.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Patch Notes:
* Removed ZFS_DEBUG_SPA from zfs-module-parameters.5
OpenZFS-issue: https://www.illumos.org/issues/9236
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cfaba7f668Closes#7467
Only filesystems and volumes are valid 'zfs remap' parameters: when
passed a snapshot name zfs_remap_indirects() does not handle the
EINVAL returned from libzfs_core, which results in failing an assertion
and consequently crashing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7454
This patch fixes 2 issues in how spill blocks are processed during
raw sends. The first problem is that compressed spill blocks were
using the logical length rather than the physical length to
determine how much data to dump into the send stream. The second
issue is a typo that caused the spill record's object number to be
used where the objset's ID number was required. Both issues have
been corrected, and the payload_size is now printed in zstreamdump
for future debugging.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7378Closes#7432
Authored by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/9280
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/243952cCloses#7445
Remove duplicate segment copies to minimize the possible search
space for reconstruction. Once reduced an accurate assessment can
be made regarding the difficulty in reconstructing the block.
Also, ztest will now run zdb with
zfs_reconstruct_indirect_combinations_max set to 1000000 in an attempt
to avoid checksum errors.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6900
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.
There are two underlying problems:
1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.
The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).
2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.
Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.
The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).
The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.
This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).
Porting notes:
* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
to dsl_scan_needs_resilver(), which was added to ZoL as part of the
sequential scrub work.
* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
parameter. The extra parameter is unique to ZoL.
* When posting indirect checksum errors the ABD can be passed directly,
zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591Closes#6900
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1ebCloses#6900
zfs-mount-generator implements the "systemd generator" protocol,
producing systemd.mount units from the cached outputs of zfs list,
during early boot, integrating with systemd.
Each pool has an indpendent cache of the command
zfs list -H -oname,mountpoint,canmount -tfilesystem -r $pool
which is kept synchronized by the ZEDLET
history_event-zfs-list-cacher.sh
Datasets not in the cache will be loaded later in the boot process by
zfs-mount.service, including pools without a cache.
Among other things, this allows for complex mount hierarchies.
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes#7329
Currently, "zfs mount -a" will print a warning and fail to mount
any encrypted datasets that do not have a key loaded. This patch
makes the behavior of this failure consistent with other failure
modes ("zfs mount -a" will silently continue, explict "zfs mount"
will print a message and return an error code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7382
Fix a bunch of (mostly) sprintf/snprintf truncation compiler
warnings that show up on Fedora 28 (GCC 8.0.1).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#7361Closes#7368
The changes piggyback JSON output support on top of channel programs
(#6558). This way the JSON output support is targeted to scripting
use cases and is easily maintainable since it really only touches
one function (zfs_do_channel_program()).
This patch ports Joyent's JSON nvlist library from illumos to enable
easy JSON printing of channel program output nvlist. To keep the
delta small I also took advantage of the fact that printing in
zfs_do_channel_program() was almost always done before exiting
the program.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes#7281
When the pool is suspended, record whether it was due to an I/O error or
due to MMP writes failing to succeed within the required time.
Change spa_suspended from uint8_t to zio_suspend_reason_t to store the
reason.
When userspace queries pool status via spa_tryimport(), report the
reason the pool was suspended in a new key,
ZPOOL_CONFIG_SUSPENDED_REASON.
In libzfs, when interpreting the returned config nvlist, report
suspension due to MMP with a new pool status enum value,
ZPOOL_STATUS_IO_FAILURE_MMP.
In status_callback(), which generates and emits the message when 'zpool
status' is executed, add a case to print an appropriate message for the
new pool status enum value.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#7296
Change zfs destroy logic so destroying begins before
the entire list of snapshots is built.
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes#7271
get_format_prompt_string() and zpool_state_to_name() return
a string literal which is read-only, thus they should return
`const char*`.
zpool_get_prop_string() returns a non-const string after
successful nv-lookup, and returns a string literal otherwise.
Since this function is designed to be used for read-only purpose,
the return type should also be `const char*`.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes#7285
Some usage patterns like send/recv of replication streams can
produce a large number of events. In such a case, the current
all-syslog.sh zedlet will hold up to its name, and flood the
logs with mostly redundant information. Two mitigate this
situation, this changeset introduces to new variables
ZED_SYSLOG_SUBCLASS_INCLUDE and ZED_SYSLOG_SUBCLASS_EXCLUDE
to zed.rc that give more control over which event classes end
up in the syslog.
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Daniel Kobras <d.kobras@science-computing.de>
Closes#6886Closes#7260
1) The Coverity Scan reports some issues for the project
quota patch, including:
1.1) zfs_prop_get_userquota() directly uses the const quota
type value as the condition check by wrong.
1.2) dmu_objset_userquota_get_ids() may cause dnode::dn_newgid
to be overwritten by dnode::dn->dn_oldprojid.
2) This patch fixes related issues. It also enhances the logic
for zfs_project_item_alloc() to avoid buffer overflow.
3) Skip project quota ability check if does not change project
quota related things (id or flag). Otherwise, it will cause
chattr (for other non project quota flags) operation failed
if project quota disabled.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fan Yong <fan.yong@intel.com>
Closes#7251Closes#7265
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: John Eismeier <john.eismeier@gmail.com>
Closes#7237
Add new script arc_summary3.py as a complete rewrite of the
arc_summary.py tool (see issue #6873)
Add new options:
-g/--graph - Display crude graphic representation
of ARC status and quit
-r/--raw - Print all available information as
minimally formatted list (for grep)
-s/--section - Print a single section. This
replaces -p/--page, which is kept for
backwards use but marked as
depreciated
Add new sections with information on ZIL and SPL. Notify user
if sections L2ARC and VDEV are skipped instead of failing
silently. Add warning that -p/--page option is depreciated.
Developed for Python 3.5.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes#6873Closes#6892
Add in SMART self-test results to zpool status|iostat -c. This
works for both SAS and SATA drives.
Also, add plumbing to allow the 'smart' script to take smartctl
output from a directory of output text files instead of running
it against the vdevs.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#7178
When invoked with wrong parameters 'zfs bookmark' fails to gracefully
validate user input and crashes.
This is a regression accidentally introduced in 587e228; this commit
adds additional tests to the ZFS Test Suite to exercise this codepath.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: KireinaHoro <i@jsteward.moe>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7228Closes#7229
* Add a zed script to kick off a scrub after a resilver. The script is
disabled by default.
* Add a optional $PATH (-P) option to zed to allow it to use a custom
$PATH for its zedlets. This is needed when you're running zed under
the ZTS in a local workspace.
* Update test scripts to not copy in all-debug.sh and all-syslog.sh by
default. They can be optionally copied in as part of zed_setup().
These scripts slow down zed considerably under heavy events loads and
can cause events to be dropped or their delivery delayed. This was
causing some sporadic failures in the 'fault' tests.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#4662Closes#7086
This adds the SMART attributes required to probe Samsung SSD and NVMe
(and possibly others) disks when using the "zpool status -c" command.
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes#7183Closes#7193
This change implements 'zfs send -b' which can be used to send only
received property values whether or not they are overridden by local
settings.
This can be very useful during "restore" operations from a backup pool
because it allows to send only the property values originally sent
from the backup source, even though they were later modified on the
destination either by a 'zfs set' operation, explicit 'zfs inherit' or
overridden during the receive process via 'zfs receive -o|-x'.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7156
CID 173243, 173245: Memory - corruptions (OVERRUN)
Added size argument to lcompat_sprintf() to avoid use of INT_MAX
CID 173244: Integer handling issues (OVERFLOW_BEFORE_WIDEN)
Added cast to uint64_t to avoid a 32 bit overflow warning
CID 173242: Integer handling issues (CONSTANT_EXPRESSION_RESULT)
Conditionally removed unused luai_numisnan() floating point check
CID 173241: Resource leaks (RESOURCE_LEAK)
Added missing close(fd) on error path
CID 173240: (UNINIT)
Fixed uninitialized variable in get_special_prop()
CID 147560: Null pointer dereferences (NULL_RETURNS)
Cleaned up bad code merge in dsl_dataset_promote_check()
CID 28475: Memory - illegal accesses (OVERRUN)
Fixed lcompat_sprintf() to use a size paramater
CID 28418, 28422: Error handling issues (CHECKED_RETURN)
Added function result cast to (void) to avoid warning
CID 23935, 28411, 28412: Memory - corruptions (ARRAY_VS_SINGLETON)
Added casts to avoid exposing result as an array
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#7181
Project quota is a new ZFS system space/object usage accounting
and enforcement mechanism. Similar as user/group quota, project
quota is another dimension of system quota. It bases on the new
object attribute - project ID.
Project ID is a numerical value to indicate to which project an
object belongs. An object only can belong to one project though
you (the object owner or privileged user) can change the object
project ID via 'chattr -p' or 'zfs project [-s] -p' explicitly.
The object also can inherit the project ID from its parent when
created if the parent has the project inherit flag (that can be
set via 'chattr +P' or 'zfs project -s [-p]').
By accounting the spaces/objects belong to the same project, we
can know how many spaces/objects used by the project. And if we
set the upper limit then we can control the spaces/objects that
are consumed by such project. It is useful when multiple groups
and users cooperate for the same project, or a user/group needs
to participate in multiple projects.
Support the following commands and functionalities:
zfs set projectquota@project
zfs set projectobjquota@project
zfs get projectquota@project
zfs get projectobjquota@project
zfs get projectused@project
zfs get projectobjused@project
zfs projectspace
zfs allow projectquota
zfs allow projectobjquota
zfs allow projectused
zfs allow projectobjused
zfs unallow projectquota
zfs unallow projectobjquota
zfs unallow projectused
zfs unallow projectobjused
chattr +/-P
chattr -p project_id
lsattr -p
This patch also supports tree quota based on the project quota via
"zfs project" commands set as following:
zfs project [-d|-r] <file|directory ...>
zfs project -C [-k] [-r] <file|directory ...>
zfs project -c [-0] [-d|-r] [-p id] <file|directory ...>
zfs project [-p id] [-r] [-s] <file|directory ...>
For "df [-i] $DIR" command, if we set INHERIT (project ID) flag on
the $DIR, then the proejct [obj]quota and [obj]used values for the
$DIR's project ID will be shown as the total/free (avail) resource.
Keep the same behavior as EXT4/XFS does.
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by Ned Bass <bass6@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fan Yong <fan.yong@intel.com>
TEST_ZIMPORT_POOLS="zol-0.6.1 zol-0.6.2 master"
Change-Id: Ib4f0544602e03fb61fd46a849d7ba51a6005693c
Closes#6290
Receiving an incremental stream after an interrupted "zfs receive -s"
fails with the message "dataset is busy": this is because we still have
the hidden clone ../%recv from the resumable receive.
Improve the error message suggesting the existence of a partially
complete resumable stream from "zfs receive -s" which can be either
aborted ("zfs receive -A") or resumed ("zfs send -t").
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7129Closes#7154
zdb -ed on objset for exported pool would failed with:
failed to own dataset 'qq/fs0': No such file or directory
The reason is that zdb pass objset name to spa_import, it uses that
name to create a spa. Later, when dmu_objset_own tries to lookup the spa
using real pool name, it can't find one.
We fix this by make sure we pass pool name rather than objset name to
spa_import.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#7099Closes#6464
SPA_MAXBLOCKSIZE is too large for stack.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#7099
There are some issues in the zdb -R decompression implementation.
The first is that ZLE can easily decompress non-ZLE streams. So we add
ZDB_NO_ZLE env to make zdb skip ZLE.
The second is the random bytes appended to pabd, pbuf2 stuff. This serve
no purpose at all, those bytes shouldn't be read during decompression
anyway. Instead, we randomize lbuf2, so that we can make sure
decompression fill exactly to lsize by bcmp lbuf and lbuf2.
The last one is the condition to detect fail is wrong.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#7099Closes#4984
zcb_haderrors will be modified in zdb_blkptr_done, which is
asynchronous. So we must move this assignment after zio_wait.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes#7099
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>
We want to be able to run channel programs outside of synching
context. This would greatly improve performance for channel programs
that just gather information, as they won't have to wait for synching
context anymore.
=== What is implemented?
This feature introduces the following:
- A new command line flag in "zfs program" to specify our intention
to run in open context. (The -n option)
- A new flag/option within the channel program ioctl which selects
the context.
- Appropriate error handling whenever we try a channel program in
open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.
=== How do we handle zfs.sync functions in open context?
When such a function is found by the interpreter and we are running
in open context we abort the script and we spit out a descriptive
runtime error. For example, given the script below ...
arg = ...
fs = arg["argv"][1]
err = zfs.sync.destroy(fs)
msg = "destroying " .. fs .. " err=" .. err
return msg
if we run it in open context, we will get back the following error:
Channel program execution failed:
[string "channel program"]:3: running functions from the zfs.sync
submodule requires passing sync=TRUE to lzc_channel_program()
(i.e. do not specify the "-n" command line argument)
stack traceback:
[C]: in function 'destroy'
[string "channel program"]:3: in main chunk
=== What about testing?
We've introduced new wrappers for all channel program tests that
run each channel program as both (startard & open-context) and
expect the appropriate behavior depending on the program using
the zfs.sync module.
OpenZFS-issue: https://www.illumos.org/issues/8677
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/17a49e15Closes#6558
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Don Brady <don.brady@delphix.com>
Ported-by: John Kennedy <john.kennedy@delphix.com>
OpenZFS-issue: https://www.illumos.org/issues/7431
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/dfc11533
Porting Notes:
* The CLI long option arguments for '-t' and '-m' don't parse on linux
* Switched from kmem_alloc to vmem_alloc in zcp_lua_alloc
* Lua implementation is built as its own module (zlua.ko)
* Lua headers consumed directly by zfs code moved to 'include/sys/lua/'
* There is no native setjmp/longjump available in stock Linux kernel.
Brought over implementations from illumos and FreeBSD
* The get_temporary_prop() was adapted due to VFS platform differences
* Use of inline functions in lua parser to reduce stack usage per C call
* Skip some ZFS Test Suite ZCP tests on sparc64 to avoid stack overflow
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: DHE <git@dehacked.net>
Closes#7133
In order to reliably detect deadlocks in the create and import
path ztest should set the failure mode property. This ensures
that the pool is always using the correct failmode behavior.
Removed insidious use of local variable in MAXFAULTS macro.
Converted VERIFY() to VERIFY0() where appropriate.
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7111
Currently, when a raw zfs send file includes a DRR_OBJECT record
that would decrease the number of levels of an existing object,
the object is reallocated with dmu_object_reclaim() which
creates the new dnode using the old object's nlevels. For non-raw
sends this doesn't really matter, but raw sends require that
nlevels on the receive side match that of the send side so that
the checksum-of-MAC tree can be properly maintained. This patch
corrects the issue by freeing the object completely before
allocating it again in this case.
This patch also corrects several issues with dnode_hold_impl()
and related functions that prevented dnodes (particularly
multi-slot dnodes) from being reallocated properly due to
the fact that existing dnodes were not being fully cleaned up
when they were freed.
This patch adds a test to make sure that zfs recv functions
properly with incremental streams containing dnodes of different
sizes.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6821Closes#6864
The on-disk format for encrypted datasets protects not only
the encrypted and authenticated blocks themselves, but also
the order and interpretation of these blocks. In order to
make this work while maintaining the ability to do raw
sends, the indirect bps maintain a secure checksum of all
the MACs in the block below it along with a few other
fields that determine how the data is interpreted.
Unfortunately, the current on-disk format erroneously
includes some fields which are not portable and thus cannot
support raw sends. It is not possible to easily work around
this issue due to a separate and much smaller bug which
causes indirect blocks for encrypted dnodes to not be
compressed, which conflicts with the previous bug. In
addition, the current code generates incompatible on-disk
formats on big endian and little endian systems due to an
issue with how block pointers are authenticated. Finally,
raw send streams do not currently include dn_maxblkid when
sending both the metadnode and normal dnodes which are
needed in order to ensure that we are correctly maintaining
the portable objset MAC.
This patch zero's out the offending fields when computing
the bp MAC and ensures that these MACs are always
calculated in little endian order (regardless of the host
system's byte order). This patch also registers an errata
for the old on-disk format, which we detect by adding a
"version" field to newly created DSL Crypto Keys. We allow
datasets without a version (version 0) to only be mounted
for read so that they can easily be migrated. We also now
include dn_maxblkid in raw send streams to ensure the MAC
can be maintained correctly.
This patch also contains minor bug fixes and cleanups.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#6845Closes#6864Closes#7052
* Remove 'zfs snap' from zfs help message (OpenZFS sync)
* Update zfs(8) to suggest 'snap' can be used as an alias for 'snapshot'
* Enforce 80 columns limit in help messages
* Remove zfs_disable_dup_eviction from zfs-module-parameters(5)
* Expose zfs_scan_max_ext_gap as a kernel module parameter.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#7087
Introduce kstats about the dbuf hash and dbuf cache
to make it easier to inspect state. This should help
with debugging and understanding of these portions
of the codebase.
Correct format of dbuf kstat file.
Introduce a dbc column to dbufs kstat to indicate if
a dbuf is in the dbuf cache.
Introduce field filtering in the dbufstat python script.
Introduce a no header option to the dbufstat python script.
Introduce a test case to test basic mru->mfu list movement
in the ARC.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#6906
The zdb(8) command may not terminate in the case where the pool
gets suspended and there is a caller in zio_wait() blocking on
an outstanding read I/O that will never complete. This can in
turn cause ztest(1) to block indefinitely despite the deadman.
Resolve the issue by setting the default failure mode for zdb(8)
to panic. In user space we always want the command to terminate
when forward progress is no longer possible.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6999
The intent of this patch is extend the existing deadman code
such that it's flexible enough to be used by both ztest and
on production systems. The proposed changes include:
* Added a new `zfs_deadman_failmode` module option which is
used to dynamically control the behavior of the deadman. It's
loosely modeled after, but independant from, the pool failmode
property. It can be set to wait, continue, or panic.
* wait - Wait for the "hung" I/O (default)
* continue - Attempt to recover from a "hung" I/O
* panic - Panic the system
* Added a new `zfs_deadman_ziotime_ms` module option which is
analogous to `zfs_deadman_synctime_ms` except instead of
applying to a pool TXG sync it applies to zio_wait(). A
default value of 300s is used to define a "hung" zio.
* The ztest deadman thread has been re-enabled by default,
aligned with the upstream OpenZFS code, and then extended
to terminate the process when it takes significantly longer
to complete than expected.
* The -G option was added to ztest to print the internal debug
log when a fatal error is encountered. This same option was
previously added to zdb in commit fa603f82. Update zloop.sh
to unconditionally pass -G to obtain additional debugging.
* The FM_EREPORT_ZFS_DELAY event which was previously posted
when the deadman detect a "hung" pool has been replaced by
a new dedicated FM_EREPORT_ZFS_DEADMAN event.
* The proposed recovery logic attempts to restart a "hung"
zio by calling zio_interrupt() on any outstanding leaf zios.
We may want to further restrict this to zios in either the
ZIO_STAGE_VDEV_IO_START or ZIO_STAGE_VDEV_IO_DONE stages.
Calling zio_interrupt() is expected to only be useful for
cases when an IO has been submitted to the physical device
but for some reasonable the completion callback hasn't been
called by the lower layers. This shouldn't be possible but
has been observed and may be caused by kernel/driver bugs.
* The 'zfs_deadman_synctime_ms' default value was reduced from
1000s to 600s.
* Depending on how ztest fails there may be no cache file to
move. This should not be considered fatal, collect the logs
which are available and carry on.
* Add deadman test cases for spa_deadman() and zio_wait().
* Increase default zfs_deadman_checktime_ms to 60s.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6999
usr/src/uts/common/sys/fs/zfs.h
Change ZPROP_INVAL and ZPROP_CONT from macros to enum values. Clang
and GCC both prefer to use unsigned ints to store enums. That was
causing tautological comparison warnings (and likely eliminating
error handling code at compile time) whenever a zfs_prop_t or
zpool_prop_t was compared to ZPROP_INVAL or ZPROP_CONT. Making the
error flags be explicity enum values forces the enum types to be
signed.
ZPROP_INVAL was also compared against two different enum types. I
had to change its name to ZPOOL_PROP_INVAL whenever its compared to
a zpool_prop_t. There are still some places where ZPROP_INVAL or
ZPROP_CONT is compared to a plain int, in code that doesn't know
whether the int is storing a zfs_prop_t or a zpool_prop_t.
usr/src/uts/common/fs/zfs/spa.c
s/ZPROP_INVAL/ZPOOL_PROP_INVAL/
Authored by: Alan Somers <asomers@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/8652
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c2de80dc74Closes#7061
Resolve new warnings reported after upgrading to shellcheck
version 0.4.6. This patch contains no functional changes.
* egrep is non-standard and deprecated. Use grep -E instead. [SC2196]
* Check exit code directly with e.g. 'if mycmd;', not indirectly
with $?. [SC2181] Suppressed.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7040
For ztest, which is solely for testing, using a pseudo random
is entirely reasonable. Using /dev/urandom ensures the system
entropy pool doesn't get depleted thus stalling the testing.
This is a particular problem when testing in VMs.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7017Closes#7036
When --enable-asan is provided to configure then build all user
space components with fsanitize=address. For kernel support
use the Linux KASAN feature instead.
https://github.com/google/sanitizers/wiki/AddressSanitizer
When using gcc version 4.8 any test case which intentionally
generates a core dump will fail when using --enable-asan.
The default behavior is to disable core dumps and only newer
versions allow this behavior to be controled at run time with
the ASAN_OPTIONS environment variable.
Additionally, this patch includes some build system cleanup.
* Rules.am updated to set the minimum AM_CFLAGS, AM_CPPFLAGS,
and AM_LDFLAGS. Any additional flags should be added on a
per-Makefile basic. The --enable-debug and --enable-asan
options apply to all user space binaries and libraries.
* Compiler checks consolidated in always-compiler-options.m4
and renamed for consistency.
* -fstack-check compiler flag was removed, this functionality
is provided by asan when configured with --enable-asan.
* Split DEBUG_CFLAGS in to DEBUG_CFLAGS, DEBUG_CPPFLAGS, and
DEBUG_LDFLAGS.
* Moved default kernel build flags in to module/Makefile.in and
split in to ZFS_MODULE_CFLAGS and ZFS_MODULE_CPPFLAGS. These
flags are set with the standard ccflags-y kbuild mechanism.
* -Wframe-larger-than checks applied only to binaries or
libraries which include source files which are built in
both user space and kernel space. This restriction is
relaxed for user space only utilities.
* -Wno-unused-but-set-variable applied only to libzfs and
libzpool. The remaining warnings are the result of an
ASSERT using a variable when is always declared.
* -D_POSIX_PTHREAD_SEMANTICS and -D__EXTENSIONS__ dropped
because they are Solaris specific and thus not needed.
* Ensure $GDB is defined as gdb by default in zloop.sh.
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7027
Resolved unused variable warnings observed after restricting
-Wno-unused-but-set-variable to only libzfs and libzpool.
Reviewed-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6941
In ztest_verify_dnode_bt the ztest_object_lock must be held in
order to safely verify the unused bonus space.
Reviewed-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6941
This fixes zhack's command processing on ARM. On ARM char
is unsigned, and so, in promotion to an int, it will never
compare equal to -1.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu>
Closes#7016
When replacing a faulted device which was previously handled by a spare
multiple levels of nested interior VDEVs will be present in the pool
configuration; the following example illustrates one of the possible
situations:
NAME STATE READ WRITE CKSUM
testpool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
spare-0 DEGRADED 0 0 0
replacing-0 DEGRADED 0 0 0
/var/tmp/fault-dev UNAVAIL 0 0 0 cannot open
/var/tmp/replace-dev ONLINE 0 0 0
/var/tmp/spare-dev1 ONLINE 0 0 0
/var/tmp/safe-dev ONLINE 0 0 0
spares
/var/tmp/spare-dev1 INUSE currently in use
This is safe and allowed, but get_replication() needs to handle this
situation gracefully to let zpool add new devices to the pool.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#6678Closes#6996
This extends vdev_id to support a new slot type, ses, for SCSI Enclosure
Services. With slot type ses, the disk slot numbers are determined by
using the device slot number reported by sg_ses for the device with
matching SAS address, found by querying all available enclosures.
This is primarily of use on systems with a deficient driver omitting
support for bay_identifier in /sys/devices. In my testing, I found that
the existing slot types of port and id were not stable across disk
replacement, so an alternative was required.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Simon Guest <simon.guest@tesujimath.org>
Closes#6956
Using a command similar to 'arc_summary.py | head' causes
a broken pipe exception. Gracefully exit in the case of a
broken pipe in arc_summary.py.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#6965Closes#6969
If an invalid option is provided to arc_summary.py we handle any error
thrown from the getopt Python module and print the usage help message.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#6983
The function that fills the uberblock ring buffer on every device label
has been reworked to avoid occasional failures caused by a race
condition that prevents 'zpool sync' from writing some uberblock
sequentially: this happens when the pool sync ioctl dispatch code calls
txg_wait_synced() while we're already waiting for a TXG to sync.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#6924Closes#6977
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#6962
* Teach ZED to handle spares usingi the configured ashift: if the zpool
'ashift' property is set then ZED should use its value when kicking
in a hotspare; with this change 512e disks can be used as spares
for VDEVs that were created with ashift=9, even if ZFS natively
detects them as 4K block devices.
* Introduce an additional auto_spare test case which verifies that in
the face of multiple device failures an appropiate number of spares
are kicked in.
* Fix zed_stop() in "libtest.shlib" which did not correctly wait the
target pid.
* Fix ZED crashing on startup caused by a race condition in libzfs
when used in multi-threaded context.
* Convert ZED over to using the tpool library which is already present
in the Illumos FMA code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#2562Closes#6858
Fix a segfault when running 'zpool iostat -v 1' while adding
a VDEV.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#6748Closes#6872
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>
Problem
=======
The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:
1. When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.
As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).
To further complicate the situation, these two issues result in the
following side effect:
3. If a given batch takes longer to complete than normal, this results
in larger batch sizes, which then take longer to complete and
further drive up the latency of zil_commit(). This can occur for a
number of reasons, including (but not limited to): transient changes
in the workload, and storage latency irregularites.
Solution
========
The solution attempted by this change has the following goals:
1. no on-disk changes; maintain current on-disk format.
2. modify the "batch size" to be equal to the "ZIL block size".
3. allow new batches to be generated and issued to disk, while there's
already batches being serviced by the disk.
4. allow zil_commit() to wait for as few ZIL blocks as possible.
5. use as few ZIL blocks as possible, for the same amount of ZIL
transactions, without introducing significant latency to any
individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.
In theory, with these goals met, the new allgorithm will allow the
following improvements:
1. new ZIL blocks can be generated and issued, while there's already
oustanding ZIL blocks being serviced by the storage.
2. the latency of zil_commit() should be proportional to the underlying
storage latency, rather than the incoming synchronous workload.
Porting Notes
=============
Due to the changes made in commit 119a394ab0, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).
To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:
* A list of itxs was added to the lwb structure. This list contains
all of the itxs that have been committed to the lwb, such that the
callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
after the data for the itxs is committed to disk.
* A list of itxs was added on the stack of the zil_process_commit_list()
function; the "nolwb_itxs" list. In some circumstances, an itx may
not be committed to an lwb (e.g. if allocating the "next" ZIL block
on disk fails), so this list is used to keep track of which itxs
fall into this state, such that their callbacks can be called after
the ZIL's writer pipeline is "stalled".
* The logic to actually call the itx's callback was moved into the
zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
were effectively performing the same logic (i.e. if callback is
non-null, call the callback), it seemed like useful code cleanup to
consolidate this logic into a single function.
Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:
* The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
removed from "trace_zil.h" as well.
* Some of the zilog structure's fields were removed, which affected
the tracepoint definitions of the structure.
* New tracepoints had to be added for the following 3 new probes:
* zil__process__commit__itx
* zil__process__normal__itx
* zil__commit__io__error
OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5d95a3aCloses#6566
When the pool configuration contains a hole due to a previous device
removal ignore this top level vdev. Failure to do so will result in
the current configuration being assessed to have a non-uniform
replication level and the expected warning will be disabled.
The zpool_add_010_pos test case was extended to cover this scenario.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6907Closes#6911
Display correct data from kstat arcstats for evict_skips,
which is currently repeating the data from mutex_misses.
Fixes#6882
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes#6882Closes#6883
Currently, scrubs and resilvers can take an extremely
long time to complete. This is largely due to the fact
that zfs scans process pools in logical order, as
determined by each block's bookmark. This makes sense
from a simplicity perspective, but blocks in zfs are
often scattered randomly across disks, particularly
due to zfs's copy-on-write mechanisms.
This patch improves performance by splitting scrubs
and resilvers into a metadata scanning phase and an IO
issuing phase. The metadata scan reads through the
structure of the pool and gathers an in-memory queue
of I/Os, sorted by size and offset on disk. The issuing
phase will then issue the scrub I/Os as sequentially as
possible, greatly improving performance.
This patch also updates and cleans up some of the scan
code which has not been updated in several years.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Authored-by: Alek Pinchuk <apinchuk@datto.com>
Authored-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#3625Closes#6256
Remove unused library re and associated variable kstat_pobj. Add note
to documentation at start of program about required support for old
versions of Python. Change variable "format" (which is a built-in
function) to "fmt".
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes#6869
Prevents arc_summary.py crashing when called with parameter -d or
long form --description with Python3.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes#6849Closes#6850
Sort list of tunables printed by _tunable_summary()
alphabetically
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes#6828
Include docstrings (PEP8, PEP257) for module and all functions.
Separately, remove outdated section in comment at start of
module. Separately, remove unused global constant "usetunable".
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes#6818
Complete rewrite of fHits(). Move units from non-standard English
abbreviations to SI units, thereby avoiding confusion because of
"long scale" and "short scale" numbers. Remove unused parameter
"Decimal". Add function string. Aim to confirm to PEP8.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes#6815
Simplify and inline single-use function div1(); inline twice-used
function div2(); add function comment to zfs_header(); replace
variable "unused" in get_Kstat() with "_" following convention.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes#6802
Porting Notes:
- The OpenZFS patch added nicenum_scale() and nicenum() to a
library not used by ZFS. Rather than pull in a new dependency
the version of nicenum in lib/libzpool/util.c was simply
replaced with the new one.
Reviewed by: Sebastian Wiedenroth <wiedi@frubar.net>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Yuri Pankov <yuripv@gmx.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Authored by: Jason King <jason.brian.king@gmail.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/640
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0a055120Closes#6796
Replace if-elif-else construction with shorter loop;
remove unused parameter "Decimal"; centralize format
string; add function documentation string; conform to
PEP8.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes#6784
Fix compiler warnings in zdb. With these changes, FreeBSD can compile
zdb with all compiler warnings enabled save -Wunused-parameter.
usr/src/cmd/zdb/zdb.c
usr/src/cmd/zdb/zdb_il.c
usr/src/uts/common/fs/zfs/sys/sa.h
usr/src/uts/common/fs/zfs/sys/spa.h
Fix numerous warnings, including:
* const-correctness
* shadowing global definitions
* signed vs unsigned comparisons
* missing prototypes, or missing static declarations
* unused variables and functions
* Unreadable array initializations
* Missing struct initializers
usr/src/cmd/zdb/zdb.h
Add a header file to declare common symbols
usr/src/lib/libzpool/common/sys/zfs_context.h
usr/src/uts/common/fs/zfs/arc.c
usr/src/uts/common/fs/zfs/dbuf.c
usr/src/uts/common/fs/zfs/spa.c
usr/src/uts/common/fs/zfs/txg.c
Add a function prototype for zk_thread_create, and ensure that every
callback supplied to this function actually matches the prototype.
usr/src/cmd/ztest/ztest.c
usr/src/uts/common/fs/zfs/sys/zil.h
usr/src/uts/common/fs/zfs/zfs_replay.c
usr/src/uts/common/fs/zfs/zvol.c
Add a function prototype for zil_replay_func_t, and ensure that
every function of this type actually matches the prototype.
usr/src/uts/common/fs/zfs/sys/refcount.h
Change FTAG so it discards any constness of __func__, necessary
since existing APIs expect it passed as void *.
Porting Notes:
- Many of these fixes have already been applied to Linux. For
consistency the OpenZFS version of a change was applied if the
warning was addressed in an equivalent but different fashion.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Authored by: Alan Somers <asomers@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/8081
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/843abe1b8aCloses#6787
Additionally add four new tests:
* zpool_events_clear: verify 'zpool events -c' functionality
* zpool_events_cliargs: verify command line options and arguments
* zpool_events_follow: verify 'zpool events -f'
* zpool_events_poolname: verify events filtering by pool name
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#3285Closes#6762
Added -n flag to zpool reopen that allows a running scrub
operation to continue if there is a device with Dirty Time Log.
By default if a component device has a DTL and zpool reopen
is executed all running scan operations will be restarted.
Added functional tests for `zpool reopen`
Tests covers following scenarios:
* `zpool reopen` without arguments,
* `zpool reopen` with pool name as argument,
* `zpool reopen` while scrubbing,
* `zpool reopen -n` while scrubbing,
* `zpool reopen -n` while resilvering,
* `zpool reopen` with bad arguments.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com>
Closes#6076Closes#6746
Otherwise, if arcstat gets interrupted before the desired number of
iterations is reached, the output file will be empty (both if set via
'-o' or via shell redirection).
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes#6775
Use mfu_size and mru_size pulled from the arcstats
kstat file to calculate the mfu and mru percentages
for arc size breakdown.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: AndCycle <andcycle@andcycle.idv.tw>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#5526Closes#6770
Fix new flake8 errors related to bare excepts and ambiguous
variable names due to a STYLE builder update.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#6776
Currently the 480GB models of this disk do not use ashift=12 by
default. SSDSC2BW48 is also optimized for 4k blocks.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: adisbladis <adis@blad.is>
Closes#6774
CID 161388: Resource Leak (REASOURCE_LEAK)
Jump to errout so that file descriptor gets closed before returning
from function.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Closes#6755
CID 147480: Logically dead code (DEADCODE)
Remove non-null check and subsequent function call. Add ASSERT to future
proof the code.
usage label is only jumped to before `zhp` is initialized.
CID 147584: Out-of-bounds access (OVERRUN)
Subtract length of current string from buffer length for `size` argument
to `snprintf`.
Starting address for the write is the start of the buffer + the current
string length. We need to subtract this string length else risk a buffer
overflow.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Closes#6745
Several issues were uncovered by running stress tests with zfs
encryption and raw sends in particular. The issues and their
associated fixes are as follows:
* arc_read_done() has the ability to chain several requests for
the same block of data via the arc_callback_t struct. In these
cases, the ARC would only use the first request's dsobj from
the bookmark to decrypt the data. This is problematic because
the first request might be a prefetch zio which is able to
handle the key not being loaded, while the second might use a
different key that it is sure will work. The fix here is to
pass the dsobj with each individual arc_callback_t so that each
request can attempt to decrypt the data separately.
* DRR_FREE and DRR_FREEOBJECT records in a send file were not
having their transactions properly tagged as raw during raw
sends, which caused a panic when the dbuf code attempted to
decrypt these blocks.
* traverse_prefetch_metadata() did not properly set
ZIO_FLAG_SPECULATIVE when issuing prefetch IOs.
* Added a few asserts and code cleanups to ensure these issues
are more detectable in the future.
Signed-off-by: Tom Caputi <tcaputi@datto.com>
* PBKDF2 implementation changed to OpenSSL implementation.
* HKDF implementation moved to its own file and tests
added to ensure correctness.
* Removed libzfs's now unnecessary dependency on libzpool
and libicp.
* Ztest can now create and test encrypted datasets. This is
currently disabled until issue #6526 is resolved, but
otherwise functions as advertised.
* Several small bug fixes discovered after enabling ztest
to run on encrypted datasets.
* Fixed coverity defects added by the encryption patch.
* Updated man pages for encrypted send / receive behavior.
* Fixed a bug where encrypted datasets could receive
DRR_WRITE_EMBEDDED records.
* Minor code cleanups / consolidation.
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Make two instances of the same change. Change bitwise AND (&) to logical
AND (&&).
Currently the code uses a bitwise AND between two boolean values.
In the first instance;
The first operand is a flag that has been bitwise combined with a bit
mask to get a boolean value as to whether a file has group write
permissions set.
The second operand used is a struct member that is intended as a
boolean flag not a bit mask.
In the second instance the argument is the same except with world write
permissions instead of group write (S_IWOTH, S_IWGRP).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Closes#6684Closes#6722
On systems with SCSI rather than SAS disk topology, this change enables
the vdev_id script to match against the block device path, and therefore
create a vdev alias in /dev/disk/by-vdev.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Simon Guest <simon.guest@tesujimath.org>
Closes#6592
Quote "$MMP_IMPORT_MSG" when it is passed as an argument, as it is a
multi-word string. Some tests were passing when they should not have,
because the grep was only testing for the first word.
Correct the message expected when no hostid is set and the test attempts
to enable multihost. It did not match the actual output in that
situation.
Disable ztest_reguid() when ztest is invoked with the -M option. If
ztest performs a reguid, a concurrent import attempt may fail with the
error "one or more devices is currently unavailable" if the guid sum is
calculated on the original device guids but compared against the guid
sum ztest wrote based on the new device guids.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#6666
FRU and LIBTOPO support are illumos only features that will not be ported to
Linux and make the code more complicated than necessary. This commit
makes way for further cleanups of the zed/FMA code.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: David Quigley <david.quigley@intel.com>
Closes#6641
The build for ztest always enabled debug information but does not enable
asserts unless --enable-debug is used. This will always enable asserts
in the ztest code.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: David Quigley <david.quigley@intel.com>
Closes#6640
This leverages the functionality introduced in cf7684b to expose
verbose, dry-run and parsable 'zfs send' options for bookmarks.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#3666Closes#6601
Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the
code, fix errors introduced by commit dbeb879 (PR #6117) interacting
badly with large dnodes, and improve performance.
* When allocating a new dnode in dmu_object_alloc_dnsize(), update the
percpu object ID for the core's metadnode chunk immediately. This
eliminates most lock contention when taking the hold and creating the
dnode.
* Correct detection of the chunk boundary to work properly with large
dnodes.
* Separate the dmu_hold_impl() code for the FREE case from the code for
the ALLOCATED case to make it easier to read.
* Fully populate the dnode handle array immediately after reading a
block of the metadnode from disk. Subsequently the dnode handle array
provides enough information to determine which dnode slots are in use
and which are free.
* Add several kstats to allow the behavior of the code to be examined.
* Verify dnode packing in large_dnode_008_pos.ksh. Since the test is
purely creates, it should leave very few holes in the metadnode.
* Add test large_dnode_009_pos.ksh, which performs concurrent creates
and deletes, to complement existing test which does only creates.
With the above fixes, there is very little contention in a test of about
200,000 racing dnode allocations produced by tests 'large_dnode_008_pos'
and 'large_dnode_009_pos'.
name type data
dnode_hold_dbuf_hold 4 0
dnode_hold_dbuf_read 4 0
dnode_hold_alloc_hits 4 3804690
dnode_hold_alloc_misses 4 216
dnode_hold_alloc_interior 4 3
dnode_hold_alloc_lock_retry 4 0
dnode_hold_alloc_lock_misses 4 0
dnode_hold_alloc_type_none 4 0
dnode_hold_free_hits 4 203105
dnode_hold_free_misses 4 4
dnode_hold_free_lock_misses 4 0
dnode_hold_free_lock_retry 4 0
dnode_hold_free_overflow 4 0
dnode_hold_free_refcount 4 57
dnode_hold_free_txg 4 0
dnode_allocate 4 203154
dnode_reallocate 4 0
dnode_buf_evict 4 23918
dnode_alloc_next_chunk 4 4887
dnode_alloc_race 4 0
dnode_alloc_next_block 4 18
The performance is slightly improved for concurrent creates with
16+ threads, and unchanged for low thread counts.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#5396Closes#6522Closes#6414Closes#6564
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#6535
Since OpenZFS 7578 (1b7c1e5) if we have a ZVOL with logbias=throughput
we will force WR_INDIRECT itxs in zvol_log_write() setting itx->itx_lr
offset and length to the offset and length of the BIO from
zvol_write()->zvol_log_write(): these offset and length are later used
to take a range lock in zillog->zl_get_data function: zvol_get_data().
Now suppose we have a ZVOL with blocksize=8K and push 4K writes to
offset 0: we will only be range-locking 0-4096. This means the
ASSERTion we make in dbuf_unoverride() is no longer valid because now
dmu_sync() is called from zilog's get_data functions holding a partial
lock on the dbuf.
Fix this by taking a range lock on the whole block in zvol_get_data().
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#6238Closes#6315Closes#6356Closes#6477
* Removed zpios kmod, utility, headers and man page.
* Removed unused scripts zpios-profile/*, zpios-test/*,
zpool-config/*, smb.sh, zpios-sanity.sh, zpios-survey.sh,
zpios.sh, and zpool-create.sh.
* Removed zfs-script-config.sh.in. When building 'make' generates
a common.sh with in-tree path information from the common.sh.in
template. This file and sourced by the test scripts and used
for in-tree testing, it is not included in the packages. When
building packages 'make install' uses the same template to
create a new common.sh which is appropriate for the packaging.
* Removed unused functions/variables from scripts/common.sh.in.
Only minimal path information and configuration environment
variables remain.
* Removed unused scripts from scripts/ directory.
* Remaining shell scripts in the scripts directory updated to
cleanly pass shellcheck and added to checked scripts.
* Renamed tests/test-runner/cmd/ to tests/test-runner/bin/ to
match install location name.
* Removed last traces of the --enable-debug-dmu-tx configure
options which was retired some time ago.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6509
With HPE hardware and hpsa-driven SAS adapters, only a single phy is
reported, but no individual per-port phys (ie. no phy* entry below
port_dir), which breaks topology detection in the current sas_handler
code. Instead, slot information can be derived directly from the port
number. This change implements a new slot keyword "port" similar to
"id" and "lun", and assumes a default phy/port of 0 if no individual
phy entry can be found. It allows to use the "sas_direct" topology with
current HPE Dxxxx and Apollo 45xx JBODs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Daniel Kobras <d.kobras@science-computing.de>
Closes#6484
Added a 'corrupt' error option that will flip a bit in the data
after a read operation. This is useful for generating checksum
errors at the device layer (in a mirror config for example). It
is also used to validate the diagnosis of checksum errors from
the zfs diagnosis engine.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes#6345
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#494Closes#5769
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4547Closes#5503Closes#5523Closes#6377Closes#6495
OpenZFS provides a library called tpool which implements thread
pools for user space applications. Porting this library means
the zpool utility no longer needs to borrow the kernel mutex and
taskq interfaces from libzpool. This code was updated to use
the tpool library which behaves in a very similar fashion.
Porting libtpool was relatively straight forward and minimal
modifications were needed. The core changes were:
* Fully convert the library to use pthreads.
* Updated signal handling.
* lmalloc/lfree converted to calloc/free
* Implemented portable pthread_attr_clone() function.
Finally, update the build system such that libzpool.so is no
longer linked in to zfs(8), zpool(8), etc. All that is required
is libzfs to which the zcommon soures were added (which is the way
it always should have been). Removing the libzpool dependency
resulted in several build issues which needed to be resolved.
* Moved zfeature support to module/zcommon/zfeature_common.c
* Moved ratelimiting to to module/zfs/zfs_ratelimit.c
* Moved get_system_hostid() to lib/libspl/gethostid.c
* Removed use of cmn_err() in zcommon source
* Removed dprintf_setup() call from zpool_main.c and zfs_main.c
* Removed highbit() and lowbit()
* Removed unnecessary library dependencies from Makefiles
* Removed fletcher-4 kstat in user space
* Added sha2 support explicitly to libzfs
* Added highbit64() and lowbit64() to zpool_util.c
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6442
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Sen Haerens <sen@senhaerens.be>
Closes#6444Closes#6445
Check in the DMU whether an object record in a send stream being
received contains an unsupported dnode slot count, and return an
error if it does. Failure to catch an unsupported dnode slot count
would result in a panic when the SPA attempts to increment the
reference count for the large_dnode feature and the pool has the
feature disabled. This is not normally an issue for a well-formed
send stream which would have the DMU_BACKUP_FEATURE_LARGE_DNODE flag
set if it contains large dnodes, so it will be rejected as
unsupported if the required feature is disabled. This change adds a
missing object record field validation.
Add missing stream feature flag checks in
dmu_recv_resume_begin_check().
Consolidate repetitive comment blocks in dmu_recv_begin_check().
Update zstreamdump to print the dnode slot count (dn_slots) for an
object record when running in verbose mode.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#6396
Add a callback to wake all running mmp threads when
zfs_multihost_interval is changed.
This is necessary when the interval is changed from a very large value
to a significantly lower one, while pools are imported that have the
multihost property enabled.
Without this commit, the mmp thread does not wake up and detect the new
interval until after it has waited the old multihost interval time. A
user monitoring mmp writes via the provided kstat would be led to
believe that the changed setting did not work.
Added a test in the ZTS under mmp to verify the new functionality is
working.
Added a test to ztest which starts and stops mmp threads, and calls into
the code to signal sleeping mmp threads, to test for deadlocks or
similar locking issues.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#6387
"zhack feature stat" performs a read-only import, so the MMP activity
check is not necessary.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#6388Closes#6389
Turning the multihost property on requires that a hostid be set to allow
ZFS to determine when a foreign system is attemping to import a pool.
The error message instructing the user to set a hostid refers to
genhostid(1).
Genhostid(1) is not available on SUSE Linux. This commit adds a script
modeled after genhostid(1) for those users.
Zgenhostid checks for an /etc/hostid file; if it does not exist, it
creates one and stores a value. If the user has provided a hostid as an
argument, that value is used. Otherwise, a random hostid is generated
and stored.
This differs from the CENTOS 6/7 versions of genhostid, which overwrite
the /etc/hostid file even though their manpages state otherwise.
A man page for zgenhostid is added. The one for genhostid is in (1), but
I put zgenhostid in (8) because I believe it's more appropriate.
The mmp tests are modified to use zgenhostid to set the hostid instead
of using the spl_hostid module parameter. zgenhostid will not replace
an existing /etc/hostid file, so new mmp_clear_hostid calls are
required.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#6358Closes#6379
When replacing a disk with non-wholedisk spare, we shouldn't zero_label
it. The wholedisk case already skip it. In fact, zero_label function
will fail saying device busy because it's already opened exclusively,
but since there's no error checking, the replace command will succeed,
causing great confusion.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#6369
zpool iostat/status -c is supposed to be restricted
by its search path, but currently isn't. To prevent
arbitrary scripts from being executed, disallow '/'
from commands.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#6353Closes#6359
CID 165757: Control flow issues (MISSING_BREAK)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes#6348
Add multihost=on|off pool property to control MMP. When enabled
a new thread writes uberblocks to the last slot in each label, at a
set frequency, to indicate to other hosts the pool is actively imported.
These uberblocks are the last synced uberblock with an updated
timestamp. Property defaults to off.
During tryimport, find the "best" uberblock (newest txg and timestamp)
repeatedly, checking for change in the found uberblock. Include the
results of the activity test in the config returned by tryimport.
These results are reported to user in "zpool import".
Allow the user to control the period between MMP writes, and the
duration of the activity test on import, via a new module parameter
zfs_multihost_interval. The period is specified in milliseconds. The
activity test duration is calculated from this value, and from the
mmp_delay in the "best" uberblock found initially.
Add a kstat interface to export statistics about Multiple Modifier
Protection (MMP) updates. Include the last synced txg number, the
timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV
label that received the last MMP update, and the VDEV path. Abbreviated
output below.
$ cat /proc/spl/kstat/zfs/mypool/multihost
31 0 0x01 10 880 105092382393521 105144180101111
txg timestamp mmp_delay vdev_guid vdev_label vdev_path
20468 261337 250274925 68396651780 3 /dev/sda
20468 261339 252023374 6267402363293 1 /dev/sdc
20468 261340 252000858 6698080955233 1 /dev/sdx
20468 261341 251980635 783892869810 2 /dev/sdy
20468 261342 253385953 8923255792467 3 /dev/sdd
20468 261344 253336622 042125143176 0 /dev/sdab
20468 261345 253310522 1200778101278 2 /dev/sde
20468 261346 253286429 0950576198362 2 /dev/sdt
20468 261347 253261545 96209817917 3 /dev/sds
20468 261349 253238188 8555725937673 3 /dev/sdb
Add a new tunable zfs_multihost_history to specify the number of MMP
updates to store history for. By default it is set to zero meaning that
no MMP statistics are stored.
When using ztest to generate activity, for automated tests of the MMP
function, some test functions interfere with the test. For example, the
pool is exported to run zdb and then imported again. Add a new ztest
function, "-M", to alter ztest behavior to prevent this.
Add new tests to verify the new functionality. Tests provided by
Giuseppe Di Natale.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#745Closes#6279
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/8067
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8173085Closes#6319
Currently, there is no way to pause a scrub. Pausing may
be useful when the pool is busy with other I/O to preserve
bandwidth.
This patch adds the ability to pause and resume scrubbing.
This is achieved by maintaining a persistent on-disk scrub state.
While the state is 'paused' we do not scrub any more blocks.
We do however perform regular scan housekeeping such as
freeing async destroyed and deadlist blocks while paused.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Serapheim Dimitropoulos <serapheimd@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes#6167
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
The problem is that zfs_get_data() supplies a stale zgd_bp to
dmu_sync(), which we then nopwrite against.
zfs_get_data() doesn't hold any DMU-related locks, so after it
copies db_blkptr to zgd_bp, dbuf_write_ready() could change
db_blkptr, and dbuf_write_done() could remove the dirty record.
dmu_sync() then sees the stale BP and that the dbuf it not dirty,
so it is eligible for nop-writing.
The fix is for dmu_sync() to copy db_blkptr to zgd_bp after
acquiring the db_mtx. We could still see a stale db_blkptr,
but if it is stale then the dirty record will still exist and
thus we won't attempt to nopwrite.
OpenZFS-issue: https://www.illumos.org/issues/8378
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3127742Closes#6293
Resolves issues discovered when porting to OpenZFS.
* Lint warnings.
* Made dnode_move_impl() large dnode aware. This
functionality is currently unused on Linux.
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#6262
GCC 7.1 with will warn when we're not checking the snprintf()
return code in cases where the buffer could be truncated. This
patch either checks the snprintf return code (where applicable),
or simply disables the warnings (ztest.c).
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#6253
This prints dashes instead of zeros for zero latency values in
'zpool iostat -p'. You'll get zero latencies reported when the
disk is idle, but technically a zero latency is invalid, since you
can't measure the latency of doing nothing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#6210
In the original form of device error injection, it was an all or nothing
situation. To help simulate intermittent error conditions, you can now
specify a real number percentage value. This is also very useful for our
ZFS fault diagnosis testing and for injecting intermittent errors during
load testing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes#6227
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: DHE <git@dehacked.net>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Jack Draak <jackdraak@gmail.com>
Signed-off-by: Kash Pande <kash@tripleback.net>
Closes#6203
- After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
- Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
- Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
- While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.
Sponsored by: iXsystems, Inc.
Authored by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7578
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aeb13acCloses#6191
Commit torvalds/linux@9e8925b6 allowed for kernels to be built
without support for mandatory locking (MS_MANDLOCK). This will
result in 'zfs mount' failing when the nbmand=on property is set
if the kernel is built without CONFIG_MANDATORY_FILE_LOCKING.
Unfortunately we can not reliably detect prior to the mount(2) system
call if the kernel was built with this support. The best we can do
is check if the mount failed with EPERM and if we passed 'mand'
as a mount option and then print a more useful error message. e.g.
filesystem 'tank/fs' has the 'nbmand=on' property set, this mount
option may be disabled in your kernel. Use 'zfs set nbmand=off'
to disable this option and try to mount the filesystem again.
Additionally, switch the default error message case to use
strerror() to produce a more human readable message.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4729Closes#6199
Allow new members to be added to a pool mixing raidz and mirror vdevs
without giving -f, as long as they have matching redundancy. This case
was missed in #5915, which only handled zpool create.
Add zfstest zpool_add_010_pos.ksh, with test of zpool create
followed by zpool add of mixed raidz and mirror vdevs.
Add some more mixed raidz and mirror cases to zpool_create_006_pos.ksh.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Haakan Johansson <f96hajo@chalmers.se>
Issue #5915Closes#6181
Users can now provide their own scripts to be run
with 'zpool iostat/status -c'. User scripts should be
placed in ~/.zpool.d to be included in zpool's
default search path.
Provide a script which can be used with
'zpool iostat|status -c' that will return the type of
device (hdd, sdd, file).
Provide a script to get various values from smartctl
when using 'zpool iostat/status -c'.
Allow users to define the ZPOOL_SCRIPTS_PATH
environment variable which can be used to override
the default 'zpool iostat/status -c' search path.
Allow the ZPOOL_SCRIPTS_ENABLED environment
variable to enable or disable 'zpool status/iostat -c'
functionality.
Use the new smart script to provide the serial command.
Install /etc/sudoers.d/zfs file which contains the sudoer
rule for smartctl as a sample.
Allow 'zpool iostat/status -c' tests to run in tree.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#6121Closes#6153
Enable most of the remaining test cases which were previously
disabled. The required fixes are as follows:
* cache_001_pos - No changes required.
* cache_010_neg - Updated to use losetup under Linux. Loopback
cache devices are allowed, ZVOLs as cache devices are not.
Disabled until all the builders pass reliably.
* cachefile_001_pos, cachefile_002_pos, cachefile_003_pos,
cachefile_004_pos - Set set_device_dir path in cachefile.cfg,
updated CPATH1 and CPATH2 to reference unique files.
* zfs_clone_005_pos - Wait for udev to create volumes.
* zfs_mount_007_pos - Updated mount options to expected Linux names.
* zfs_mount_009_neg, zfs_mount_all_001_pos - No changes required.
* zfs_unmount_005_pos, zfs_unmount_009_pos, zfs_unmount_all_001_pos -
Updated to expect -f to not unmount busy mount points under Linux.
* rsend_019_pos - Observed to occasionally take a long time on both
32-bit systems and the kmemleak builder.
* zfs_written_property_001_pos - Switched sync(1) to sync_pool.
* devices_001_pos, devices_002_neg - Updated create_dev_file() helper
for Linux.
* exec_002_neg.ksh - Fixed mmap_exec.c to preserve errno. Updated
test case to expect EPERM from Linux as described by mmap(2).
* grow_pool_001_pos - Adding missing setup.ksh and cleanup.ksh
scripts from OpenZFS.
* grow_replicas_001_pos.ksh - Added missing $SLICE_* variables.
* history_004_pos, history_006_neg, history_008_pos - Fixed by
previous commits and were not enabled. No changes required.
* zfs_allow_010_pos - Added missing spaces after assorted zfs
commands in delegate_common.kshlib.
* inuse_* - Illumos dump device tests skipped. Remaining test
cases updated to correctly create required partitions.
* large_files_001_pos - Fixed largest_file.c to accept EINVAL
as well as EFBIG as described in write(2).
* link_count_001 - Added nproc to required commands.
* umountall_001 - Updated to use umount -a.
* online_offline_001_* - Pull in OpenZFS change to file_trunc.c
to make the '-c 0' option run the test in a loop. Included
online_offline.cfg file in all test cases.
* rename_dirs_001_pos - Updated to use the rename_dir test binary,
pkill restricted to exact matches and total runtime reduced.
* slog_013_neg, write_dirs_002_pos - No changes required.
* slog_013_pos.ksh - Updated to use losetup under Linux.
* slog_014_pos.ksh - ZED will not be running, manually degrade
the damaged vdev as expected.
* nopwrite_varying_compression, nopwrite_volume - Forced pool
sync with sync_pool to ensure up to date property values.
* Fixed typos in ZED log messages. Refactored zed_* helper
functions to resolve all-syslog exit=1 errors in zedlog.
* zfs_copies_005_neg, zfs_get_004_pos, zpool_add_004_pos,
zpool_destroy_001_pos, largest_pool_001_pos, clone_001_pos.ksh,
clone_001_pos, - Skip until layering pools on zvols is solid.
* largest_pool_001_pos - Limited to 7eb pool, maximum
supported size in 8eb-1 on Linux.
* zpool_expand_001_pos, zpool_expand_003_neg - Requires
additional support from the ZED, updated skip reason.
* zfs_rollback_001_pos, zfs_rollback_002_pos - Properly cleanup
busy mount points under Linux between test loops.
* privilege_001_pos, privilege_003_pos, rollback_003_pos,
threadsappend_001_pos - Skip with log_unsupported.
* snapshot_016_pos - No changes required.
* snapshot_008_pos - Increased LIMIT from 512K to 2M and added
sync_pool to avoid false positives.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6128
This addition will enable us to sync an open TXG to the main pool
on demand. The functionality is similar to 'sync(2)' but 'zpool sync'
will return when data has hit the main storage instead of potentially
just the ZIL as is the case with the 'sync(2)' cmd.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes#6122
This patch adds a '-f' option to 'zpool offline' to fault a vdev
instead of bringing it offline. Unlike the OFFLINE state, the
FAULTED state will trigger the FMA code, allowing for things like
autoreplace and triggering the slot fault LED. The -f faults
persist across imports, unless they were set with the temporary
(-t) flag. Both persistent and temporary faults can be cleared
with zpool clear.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#6094
Enable additional test cases, in most cases this required a few
minor modifications to the test scripts. In a few cases a real
bug was uncovered and fixed. And in a handful of cases where pools
are layered on pools the test case will be skipped until this is
supported. Details below for each test case.
* zpool_add_004_pos - Skip test on Linux until adding zvols to pools
is fully supported and deadlock free.
* zpool_add_005_pos.ksh - Skip dumpadm portion of the test which isn't
relevant for Linux. The find_vfstab_dev, find_mnttab_dev, and
save_dump_dev functions were updated accordingly for Linux. Add
O_EXCL to the in-use check to prevent the -f (force) option from
working for mounted filesystems and improve the resulting error.
* zpool_add_006_pos - Update test case such that it doesn't depend
on nested pools. Switch to truncate from mkfile to reduce space
requirements and speed up the test case.
* zpool_clear_001_pos - Speed up test case by filling filesystem to
25% capacity.
* zpool_create_002_pos, zpool_create_004_pos - Use sparse files for
file vdevs in order to avoid increasing the partition size.
* zpool_create_006_pos - 6ba1ce9 allows raidz+mirror configs with
similar redundancy. Updating the valid_args and forced_args cases.
* zpool_create_008_pos - Disable overlapping partition portion.
* zpool_create_011_neg - Fix to correctly create the extra partition.
Modified zpool_vdev.c to use fstat64_blk() wrapper which includes
the st_size even for block devices.
* zpool_create_012_neg - Updated to properly find swap devices.
* zpool_create_014_neg, zpool_create_015_neg - Updated to use
swap_setup() and swap_cleanup() wrappers which do the right thing
on Linux and Illumos. Removed '-n' option which succeeds under
Linux due to differences in the in-use checks.
* zpool_create_016_pos.ksh - Skipped test case isn't useful.
* zpool_create_020_pos - Added missing / to cleanup() function.
Remove cache file prior to test to ensure a clean environment
and avoid false positives.
* zpool_destroy_001_pos - Removed test case which creates a pool on
a zvol. This is more likely to deadlock under Linux and has never
been completely supported on any platform.
* zpool_destroy_002_pos - 'zpool destroy -f' is unsupported on Linux.
Mount point must not be busy in order to unmount them.
* zfs_destroy_001_pos - Handle EBUSY error which can occur with
volumes when racing with udev.
* zpool_expand_001_pos, zpool_expand_003_neg - Skip test on Linux
until adding zvols to pools is fully supported and deadlock free.
The test could be modified to use loop-back devices but it would
be preferable to use the test case as is for improved coverage.
* zpool_export_004_pos - Updated test case to such that it doesn't
depend on nested pools. Normal file vdev under /var/tmp are fine.
* zpool_import_all_001_pos - Updated to skip partition 1, which is
known as slice 2, on Illumos. This prevents overwriting the
default TESTPOOL which was causing the failure.
* zpool_import_002_pos, zpool_import_012_pos - No changes needed.
* zpool_remove_003_pos - No changes needed
* zpool_upgrade_002_pos, zpool_upgrade_004_pos - Root cause addressed
by upstream OpenZFS commit 3b7f360.
* zpool_upgrade_007_pos - Disabled in test case due to known failure.
Opened issue https://github.com/zfsonlinux/zfs/issues/6112
* zvol_misc_002_pos - Updated to to use ext2.
* zvol_misc_001_neg, zvol_misc_003_neg, zvol_misc_004_pos,
zvol_misc_005_neg, zvol_misc_006_pos - Moved to skip list, these
test case could be updated to use Linux's crash dump facility.
* zvol_swap_* - Updated to use swap_setup/swap_cleanup helpers.
File creation switched from /tmp to /var/tmp. Enabled minimal
useful tests for Linux, skip test cases which aren't applicable.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3484
Issue #5634
Issue #2437
Issue #5202
Issue #4034Closes#6095
This allows users to specify "-o property=value" to override and
"-x property" to exclude properties when receiving a zfs send stream.
Both native and user properties can be specified.
This is useful when using zfs send/receive for periodic
backup/replication because it lets users change properties such as
canmount, mountpoint, or compression without modifying the source.
References:
https://www.illumos.org/issues/2745https://www.illumos.org/issues/3753
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#1350Closes#5349
CID 161638: Resource leak (RESOURCE_LEAK)
Ensure the string array in print_zpool_script_help
is freed in cases when there is an error.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#6111
The proposed debugging enhancements in zfsonlinux/spl#587
identified the following missing *_destroy/*_fini calls.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes#5428
This commit allow higher ashift values (up to 16) in 'zpool create'
The ashift value was previously limited to 13 (8K block) in b41c990
because the limited number of uberblocks we could fit in the
statically sized (128K) vdev label ring buffer could prevent the
ability the safely roll back a pool to recover it.
Since b02fe35 the largest uberblock size we support is 8K: this
allow us to store a minimum number of 16 uberblocks in the vdev
label, even with higher ashift values.
Additionally change 'ashift' pool property behaviour: if set it will
be used as the default hint value in subsequent vdev operations
('zpool add', 'attach' and 'replace'). A custom ashift value can still
be specified from the command line, if desired.
Finally, fix a bug in add-o_ashift.ksh caused by a missing variable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#2024Closes#4205Closes#4740Closes#5763
* Add zfs_nicebytes() to print human-readable sizes
Some 'zfs', 'zpool' and 'zdb' output strings can be confusing to the
user when no units are specified. This add a new zfs_nicenum_format
"ZFS_NICENUM_BYTES" used to print bytes in their human-readable form.
Additionally, update some test cases to use machine-parsable 'zfs get'.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#2414Closes#3185Closes#3594Closes#6032
OpenZFS 7252 - compressed zfs send / receive
OpenZFS 7628 - create long versions of ZFS send / receive options
Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Ported-by: bunder2015 <omfgbunder@gmail.com>
Ported-by: Don Brady <don.brady@intel.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:
- Most of 7252 was already picked up during ABD work. This
commit represents the gap from the final commit to openzfs.
- Fixed split_large_blocks check in do_dump()
- An alternate version of the write_compressible() function was
implemented for Linux which does not depend on fio. The behavior
of fio differs significantly based on the exact version.
- mkholes was replaced with truncate for Linux.
OpenZFS-issue: https://www.illumos.org/issues/7252
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5602294Closes#6067
This patch updates the "zpool status/iostat -c" commands to only run
"pre-baked" scripts from the /etc/zfs/zpool.d directory (or wherever
you install to). The scripts can only be run from -c as an unprivileged
user (unless the ZPOOL_SCRIPTS_AS_ROOT environment var is
set by root). This was done to encourage scripts to be written is such
a way that normal users can use them, and to be cautious. If your
script needs to run a privileged command, consider adding the
appropriate line in /etc/sudoers. See zpool(8) for an example of how
to do this.
The patch also allows the scripts to output custom column names. If
the script outputs a line like:
name=value
then "name" is used for the column name, and "value" is its value.
Multiple columns can be specified by outputting multiple lines. Column
names and values can have spaces. If the value is empty, a dash (-) is
printed instead.
After all the "name=value" lines are read (if any), zpool will take the
next the next line of output (if any) and print it without a column
header. After that, no more lines will be processed. This can be
useful for printing errors.
Lastly, this patch also disables the -c option with the latency and
request size histograms, since it produced awkward output and made the
code harder to maintain.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#5852
* bookmarks are not supported when sending all intermediary snaps (-I)
* add missing compressed (-c) option to the 'zfs' help and manpage
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#6028
Udev may fail to create the expected symbolic links in
/dev/disk/by-vdev on systems with the
device-mapper-multipath-0.4.9-100.el6 package installed. This affects
RHEL 6.9 and possibly other downstream distributions.
That version of the multipath command may incorrectly list a drive
state as "unkown" instead of "running". The issue was introduced
in the patch for https://bugzilla.redhat.com/show_bug.cgi?id=1401769
The vdev_id udev helper uses the state reported by "multipath -l" to
detect an online component disk of a multipath device in order to
resolve its physical slot and enclosure. Changing the command
invocation to "multipath -ll" works around the above issue by causing
multipath to consult additional sources of information to determine
the drive state.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#6039
This lets users create a bookmark from the command line by its name
only, without the need to specify the dataset path which is extacted
from the snapshot parameter.
These commands are now equivalent:
zfs bookmark poolname/fs@snap poolname/fs#bookmark
zfs bookmark @snap poolname/fs#bookmark
zfs bookmark poolname/fs@snap \#bookmark
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#3665Closes#6027
Authored by: Richard Yao <ryao@gentoo.org>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Porting Notes:
This was already implemented in ZFS on Linux. This patch
is to resolved the deltas present in our version.
OpenZFS-issue: https://www.illumos.org/issues/6392
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9bb97deCloses#6020
Authored by: Alan Somers <asomers@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
There are two reasons:
1) Finding a znode's path is slower than printing any other znode
information at verbosity < 5.
2) On a corrupted pool like the one mentioned below, zdb will crash when it
tries to determine the znode's path. But with this patch, zdb can still
extract useful information from such pools.
OpenZFS-issue: https://www.illumos.org/issues/7900
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2b0dee1Closes#6016
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:
- Replaced zdb.8 with upstream mdoc zdb.1m version. Updated to
include Linux specific features: -V verbatium imports and
improved label printing (-u, and -l).
- Minor changes to `zdb -h` output to honor 80 character limit.
OpenZFS-issue: https://www.illumos.org/issues/6410
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ed61ec1Closes#6006
Be sure to invalidate a vdev's cache before performing
a zpool labelclear. There are cases where the cache is
stale because we did some operation that bypassed it,
and since we are doing an open with only O_RDWR, we
should invalidate it to be safe.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#6009
GCC 4.9.4 complains about implicit function declarations when building
against musl on Gentoo.
zed_log.c: In function ‘zed_log_pipe_open’:
zed_log.c:69:7: warning: implicit declaration of function ‘getpid’
(int)getpid());
^
zed_log.c:71:2: warning: implicit declaration of function ‘pipe’
if (pipe(_ctx.pipe_fd) < 0)
^
zed_log.c: In function ‘zed_log_pipe_close_reads’:
zed_log.c:90:2: warning: implicit declaration of function ‘close’
if (close(_ctx.pipe_fd[0]) < 0)
^
zed_log.c: In function ‘zed_log_pipe_wait’:
zed_log.c:141:3: warning: implicit declaration of function ‘read’
n = read(_ctx.pipe_fd[0], &c, sizeof (c));
The [-Wimplicit-function-declaration] at the end of each warning has
been removed to meet comment style requirements.
The man pages say to include <sys/types.h> and <unistd.h>. Doing that
silences the warnings.
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#5993
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:
- Updated 'zpool labelclear' and 'zdb -l' such that they attempt
to find a vdev given solely its short name. This behavior is
consistent with the upstream OpenZFS code and the test cases
depend on it. The actual implementation differs slightly due
to device naming conventions on Linux.
- auto_online_001_pos, auto_replace_001_pos and add-o_ashift
test cases updated to expect failure when no label exists.
- read_efi_label() and zpool_label_disk_check() are read-only
operations and should use O_RDONLY at open time to enforce this.
- zpool_label_disk() and zpool_relabel_disk() write the partition
information using O_DIRECT an fsync() and page cache invalidation
to ensure a consistent view of the device.
- dump_label() in zdb should invalidate the page cache in order
to get the authoritative label from disk.
OpenZFS-issue: https://www.illumos.org/issues/6865
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c95076cCloses#5981
CID 161288: Null pointer dereferences (REVERSE_INULL)
Ensure physpath != NULL before the strcmp.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#5974
Also included are updates to auto-online test
Automated auto-replace test to go along with ZED FMA integration
(PR 4673) auto-replace_001.pos works using a scsi_debug device
(the only usable virtual device currently due to whole_disk var
needing to be set)
Functionality for automated FMA auto-replace test to work with
scsi_debug devs: Some functionality/exceptions needed to be
added for automation of auto-replace to work correctly.
In the test an alias vdev_id rule is added for any scsi_debug
device which sets the phys_path="scsidebug" after a udevadm
trigger command.
A symlink is created for the vdev_id.conf file (in /etc/zfs/ by
default) to be used in-tree for the test suite
(/var/tmp/zfs/vdev_id.conf). "./scripts/zfs-helpers.sh -i" needs
to be run before fault tests in the ZTS (to use udev rules in-tree)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: David Quigley <david.quigley@intel.com>
Signed-off-by: Sydney Vanda <sydney.m.vanda@intel.com>
Closes#5944
Allow a pool to be created with both raidz and mirror members,
without giving -f, as long as they have matching redundancy.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes#5915
CID 161264: Uninitialized variables (UNINIT)
In _zed_event_add_nvpair, when handling DATA_TYPE_UINT64,
we should be using i64 throughout the entire case.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@intel.com>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#5964
* Add ZPOOL pool state to zfs_post_common to
allow differentiation between export and destroy
by zedlets.
* Add pool name as standard export This ensures
pool name is exported to zedlets.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Closes#5942
Wherever possible it's best to avoid depending on a linear ABD.
Update the code accordingly in the following areas.
- vdev_raidz
- zio, zio_checksum
- zfs_fm
- change abd_alloc_for_io() to use abd_alloc()
Reviewed-by: David Quigley <david.quigley@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes#5668
df83110 added the ability to specify a custom "ashift" value from the command
line in 'zpool add' and 'zpool attach'. This commit adds additional checks to
the provided ashift to prevent invalid values from being used, which could
result in disastrous consequences for the whole pool.
Additionally provide ASHIFT_MAX and ASHIFT_MIN definitions in spa.h.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#5878
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: SenH <sen@senhaerens.be>
Closes#5933
Add trivial libzfs_fru_compare() function which can be used when
HAVE_LIBTOPO is not defined. The only caller is find_vdev() and
this function should never be reached because search_fru must be
NULL unless HAVE_LIBTOPO is defined.
Rename _HAS_FMD_TOPO to existing HAVE_LIBTOPO which was
originally added for this purpose. This macro will never be defined.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5402Closes#5909
When a pool is suspended it's impossible to read the list
of damaged files from disk. This would result in a generic
misleading "insufficient permissions" error message.
Update zpool_get_errlog() to use the standard zpool error
logging functions to generate a useful error message. In
this case:
errors: List of errors unavailable: pool I/O is currently suspended
This patch does not address the related issue of potentially
not being able to resume a suspend pool when the underlying
device names have changed.
Additionally, remove the error handling from zfs_alloc()
in zpool_get_errlog() for readability since this function
can never fail.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4031Closes#5731Closes#5907
Several functions were renamed when ZFS was originally ported to
Linux. Revert the code to the original names to minimize the
delta with upstream OpenZFS.
zfs_sb_teardown -> zfsvfs_teardown
zfs_sb_create -> zfsvfs_create
zfs_sb_setup -> zfsvfs_setup
zfs_sb_free -> zfsvfs_free
get_zfs_sb -> getzfsvfs
zfs_sb_hold -> zfsvfs_hold
zfs_sb_rele -> zfsvfs_rele
zfs_sb_prune_aliases -> zfs_prune_aliases (Linux-only)
zfs_sb_prune -> zfs_prune (Linux only)
Align the zfs_vnops.h and zfs_vfsops.h with upstream as much
as possible. Several prototypes were removed and those that
remain were reordered.
Move the EXPORT_SYMBOL lines to the end of the source files
for consistency with the other source files.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
arc_summary and dbufstat should have two spaces
after their last function definitions.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#5881
Enable shellcheck to run on zed scripts,
paxcheck.sh, zfs-tests.sh, zfs.sh, and zloop.sh.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#5812
For zdb -l, detect when the configuration nvlist in some label l (l>0)
is the same as a configuration already dumped. If so, do not dump it.
Make a similar check when dumping Uberblocks for zdb -lu. Check whether
a label already dumped contains an identical Uberblock. If so, do not
dump the Uberblock.
When dumping a configuration or Uberblock, state which labels it is
found in (0-3), for example: labels = 1 2 3
Detecting redundant uberblocks or configurations is accomplished by
calculating checksums of the uberblocks and the packed nvlists
containing the configuration.
If there is nothing unique to be dumped for a label (ie the
configuration and uberblocks have checksums matching those already
dumped) print nothing for that label.
With additional l's or u's, increase verbosity as follows:
-l Dump each unique configuration only once.
Indicate which labels it appears in.
-ll In addition, dump label space usage stats.
-lll Dump every configuration, unique or not.
-u Dump each unique, valid, uberblock only once.
Indicate which labels it appears in.
-uu In addition, state which slots are invalid.
-uuu Dump every uberblock, unique or not.
-uuuu Dump the uberblock blockpointer (used to be -uuu)
Make exit values conform to the manual page. Failing to unpack a
configuration nvlist is considered an error, as well as failing to open
or read from the device.
Add three tests, zdb_00{3,4,5}_pos to verify the above functionality.
An example of the output:
------------------------------------
LABEL 0
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 880
< ... redacted ... >
features_for_read:
com.delphix:hole_birth
com.delphix:embedded_data
labels = 0
Uberblock[0]
magic = 0000000000bab10c
version = 5000
txg = 0
guid_sum = 3038694082047428541
timestamp = 1487715500 UTC = Tue Feb 21 14:18:20 2017
labels = 0 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 772
guid_sum = 9045970794941528051
timestamp = 1487727291 UTC = Tue Feb 21 17:34:51 2017
labels = 0
< ... redacted ... >
------------------------------------
LABEL 1
------------------------------------
version: 5000
name: 'pool'
state: 1
txg: 14
< ... redacted ... >
com.delphix:embedded_data
labels = 1 2 3
Uberblock[4]
magic = 0000000000bab10c
version = 5000
txg = 4
guid_sum = 7793930272573252584
timestamp = 1487727521 UTC = Tue Feb 21 17:38:41 2017
labels = 1 2 3
< ... redacted ... >
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#5738
If the LED is being accessed by another process when we try to update
it, the update will be lost. Add a retry loop which will read the state
of the LED and update it until the LED is in the correct state. The
number of times this will occur is limited to ensure that the ZEDlet
won't hang ZED.
Refactor to remove duplication so setting of the LED occurs in only one
place.
Cleanup a couple of the warnings generated by shellcheck which weren't
the result of specific choices by the author. Several notes and warnings
are still present but removing them would make the code less clear or
require adding lines to tell shellcheck to ignore the warning.
Remove ",i" from the documentation at the top of the file which appears
to be a typographic error.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Christopher Voltz <christopher.voltz@hpe.com>
Closes#5795
Google moved their style guides to GitHub. Update the shell style guide
URL to the new location.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Christopher Voltz <christopher.voltz@hpe.com>
Closes#5797
- Pass $VDEV_ENC_SYSFS_PATH to 'zpool [iostat|status] -c' to include
enclosure LED sysfs path.
- Set LEDs correctly after import. This includes clearing any erroniously
set LEDs prior to the import, and setting the LED for any UNAVAIL drives.
- Include symlink for vdev_attach-led.sh in Makefile.am.
- Print the VDEV path in all-syslog.sh, and fix it so the pool GUID actually
prints.
Reviewed-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#5716Closes#5751
CID 155928: Integer handling issues (DIVIDE_BY_ZERO)
In the current vdev label, the leaf count is always non-zero
but it doesn't hurt to check the count for future proofing.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes#5749
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Don Brady <don.brady@intel.com>
OpenZFS-issue: https://www.illumos.org/issues/6866
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3e4fae5Closes#5730
Porting Notes:
- Omitted the illumos-specific `/dev/dsk` and `/dev/rdsk`
path conversions since they don't apply on linux.
When dumping the ZFS vdev label with 'zdb -ll', also dump
some nvlist stats to help analyze the current footprint.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes#5724
This is effectively dead code for the Linux implementation which can
be removed to improve readability. We want to linter to check the
real production/debug build as much as possible.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#5722
This patch adds the necessary infrastructure for ABD to make use
of the vectorized fletcher 4 routines.
- export ABD compatible interface from fletcher_4
- add ABD fletcher_4 tests for data and metadata ABD types.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Original-patch-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: David Quigley <david.quigley@intel.com>
Closes#5589
There are two cases:
1. if an invalid flag is passed, and
2. if a valid flag is not given a parameter.
In the case of (1), the flag is either short or long. For short flags,
optopt contains the character of the flag. For long, it contains zero,
and we can access the long flag using argv and optind.
In the case of (2), if the flag is short, optopt contains the character
of the flag. If the flag is long, the value in the 4th column of the
long_options table, for that flag, is returned.
We could case over all those values, or we could simply use argv and
optind again.
Note that in the case of something like `--resume`, which is also `-t`,
"t" will be returned if an argument is not provided; so the error
message will say `'t': argument not provided` or similar. This could be
fixed by making it so long and short options don't use the same
character flag, and then combining them in the switch/case statement,
but I didn't think the ugliness of the code would be worth the small
usability enhancement.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>
OpenZFS-issue: https://www.illumos.org/issues/7742
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/6d69b40Closes#5702
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Porting Notes: Moved reference_tracking_enable and
reference_history outside of ZFS_DEBUG.
OpenZFS-issue: https://www.illumos.org/issues/7545
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4dd77f9Closes#5701
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>
OpenZFS-issue: https://www.illumos.org/issues/7502
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3c65d1Closes#5677
Porting notes:
- 'zfs_dbgmsg_print()' reintroduced to userspace.
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>
OpenZFS-issue: https://www.illumos.org/issues/7277
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/29bdd2fCloses#5684
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>
OpenZFS-issue: https://www.illumos.org/issues/7163
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f34284dCloses#4484Closes#5661
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: - Matthew Ahrens <mahrens@delphix.com>
Reviewed by: - Paul Dagnelie <pcd@delphix.com>
Approved by: - Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>
OpenZFS-issue: https://www.illumos.org/issues/7253
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/754998cCloses#5660
Porting notes:
- Many changes were already made in ZoL (for ex. in d4ed66734).
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>
OpenZFS-issue: https://www.illumos.org/issues/6872
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f83b46bCloses#5640
Porting notes:
- Several direct callers of zk_thread_create() are passing TS_RUN for the
length. The `len` and `state` were inverted,this commit fixes them.
Authored by: Eli Rosenthal <eli.rosenthal@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov mail@gmelikov.ru
OpenZFS-issue: https://www.illumos.org/issues/6871
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8fc9228Closes#5621
The local mg variable is unused in non-debug builds.
Wrap the variable in ASSERTV() so that it's only present
in the debug build. Introduced by OpenZFS 7303.
Reviewed-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5616
Porting Notes:
- Many of the fixes proposed by this patch were already applied.
In the cases where a different but equivalent fix was made the
code was updated with the OpenZFS version to minimize differences.
Authored by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6550
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c16bcc4Closes#5591
Porting Notes:
- Many of the fixes proposed by this patch were already applied.
In the cases where a different but equivalent fix was made the
code was updated with the OpenZFS version to minimize differences.
- The zpool_get_vdev_by_name() function was previously removed
by commit 235db0a.
Authored by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6551
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b327cd3Closes#5590
This change introduces a new weighting algorithm to improve
metaslab selection. The new weighting algorithm relies on the
SPACEMAP_HISTOGRAM feature. As a result, the metaslab weight
now encodes the type of weighting algorithm used (size-based
vs segment-based).
Porting Notes: The metaslab allocation tracing code is conditionally
removed on linux (dependent on mdb debugger).
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Don Brady <don.brady@intel.com>
OpenZFS-issue: https://www.illumos.org/issues/7303
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d5190931bdCloses#5404
Authored by: David Schwartz <dschwartz783@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>
I find that this is a lot easier to read. "not don't close" is somewhat tough on the eyes.
OpenZFS-issue: https://www.illumos.org/issues/6637
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d189620Closes#5572
Running arc_summary.py with a l2arc cache device around produces
the following error:
Traceback (most recent call last):
File "/usr/bin/arc_summary.py", line 1148, in <module>
main()
File "/usr/bin/arc_summary.py", line 1144, in main
page(Kstat)
File "/usr/bin/arc_summary.py", line 724, in _l2arc_summary
arc["l2_arc_evicts"]["reading"] > 0:
TypeError: unorderable types: str() > int()
This is due to arc["l2_arc_evicts"]['lock_retries'] and
arc["l2_arc_evicts"]["reading"] both being strings, returned
from fHits() earlier. Rather than adding them up and checking
if the result is > 0, this checks if either string is != '0'.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#5538
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes#5547Closes#5543
CID 147587: Out-of-bounds read
Future changes may cause an array overrun of 4096 bytes at byte
offset 4096 by dereferencing pointer dstp. Adding this additional
check ensures correctness.
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Closes#5297
zpool iostat allows you to specify only certain vdevs to display.
Currently, if you run 'zpool iostat -c CMD vdev1 vdev2 ...'
on specific vdevs, it will actually run the command on *all* vdevs,
and just display the results for the vdevs you specify. This patch
corrects the behavior to only run the command on the specified vdevs,
and also enables the zpool_iostat_005_pos.ksh tests.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#5443
CID 147534: Negative array index read
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes#5467
Enable picky cstyle checks and resolve the new warnings. The vast
majority of the changes needed were to handle minor issues with
whitespace formatting. This patch contains no functional changes.
Non-whitespace changes are as follows:
* 8 times ; to { } in for/while loop
* fix missing ; in cmd/zed/agents/zfs_diagnosis.c
* comment (confim -> confirm)
* change endline , to ; in cmd/zpool/zpool_main.c
* a number of /* BEGIN CSTYLED */ /* END CSTYLED */ blocks
* /* CSTYLED */ markers
* change == 0 to !
* ulong to unsigned long in module/zfs/dsl_scan.c
* rearrangement of module_param lines in module/zfs/metaslab.c
* add { } block around statement after for_each_online_node
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5465
Do not force VDEV_NAME_TYPE_ID in max_width(), instead add it
in the relevant calls to max_width().
The first location of max_width() where VDEV_NAME_TYPE_ID is
now added in show_import() is followed by print_import_config() and
print_logs(). Both these print children vdev names that have been
retrieved using an explicit VDEV_NAME_TYPE_ID added.
The second location is in status_callback(). This is followed by
print_status_config(), print_logs(), print_l2cache(), and
print_spares(). For l2cache and spares it should not matter as there
are no mirror-X or raidz-X involved. print_status_config() as above
retrieves the name using explicit VDEV_NAME_TYPE_ID before
calling itself to print children.
The call of max_width() in get_namewidth() is not changed, as this is
used by zpool_do_iostat(), followed by print_iostat(), which does not
add VDEV_NAME_TYPE_ID.
Overall, we should consider adding VDEV_NAME_TYPE_ID to the
relevant name_flags / cb_name_flags fields, and remove the explicit
adding in called routines.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes#5401
ZFS currently uses ARC buffers which are backed by virtual memory.
While functional, there are some major problems with this approach
which can be observed on all OpenZFS platforms. ABD was designed
to address these issues and includes contributions from OpenZFS
developers from multiple platforms.
While all OpenZFS platforms will benefit from ABD this functionality
is critical for Linux. Unlike the other OpenZFS platforms the Linux
kernel discourages extensive use of virtual memory. The provided
interfaces are not optimized for frequent allocations from the virtual
address space. To maintain good performance a kmem cache is
used which contains relatively long lived slabs backed by virtual
memory. The downside to the approach is that those slabs can
become highly fragmented resulting in an inefficient use of memory.
Another issue is that on 32-bit systems the available virtual
address space in the kernel is only a small fraction of total
system memory. This means the ARC size is highly constrained
which hurts performance and make allocating memory difficult
and OOMs more likely.
ABD is designed to address these issues by using scatter lists
of pages for data buffers. This removes the need for slabs
which resolves the fragmentation issue. It also allows high
memory pages to be allocated which alleviates the virtual
address space pressure on 32-bit systems.
For metadata buffers, which are small, linear ABDs are allocated
from the slab. This is preferable because there are many places
in the code which expect to be able to read from a given offset
in the buffer. Using linear ABDs means none of that code needs
to be modified. The majority of these buffers are allocated with
kmalloc so there's minimal impact of the virtual address space.
Tested-by: Kash Pande <kash@tripleback.net>
Tested-by: kernelOfTruth <kerneloftruth@gmail.com>
Tested-by: RageLtMan <rageltman@sempervictus>
Tested-by: DHE <git@dehacked.net>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: David Quigley <david.quigley@intel.com>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Isaac Huang <he.huang@intel.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3441Closes#5135
Enable vectorized raidz code on ABD buffers. The avx512f,
avx512bw, neon and aarch64_neonx2 are disabled in this commit.
With the exception of avx512bw these implementations are
updated for ABD in the subsequent commits.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Otherwise, the checksum function pointer isn't initialized.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#5411
This patch adds a command (-c) option to zpool status and zpool iostat. The
-c option allows you to run an arbitrary command on each vdev and display
the first line of output in zpool status/iostat. The environment vars
VDEV_PATH and VDEV_UPATH are set to the vdev's path and "underlying path"
before running the command. For device mapper, multipath, or partitioned
vdevs, VDEV_UPATH is the actual underlying /dev/sd* disk. This can be useful
if the command you're running requires a /dev/sd* device.
The patch also uses /sys/block/<dev>/slaves/ to lookup the underlying device
instead of using libdevmapper. This not only removes the libdevmapper
requirement at build time, but also allows you to resolve device mapper
devices without being root. This means that UDEV_UPATH get set correctly
when running zpool status/iostat as an unprivileged user.
Example:
$ zpool status -c 'echo I am $VDEV_PATH, $VDEV_UPATH'
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
mpatha ONLINE 0 0 0 I am /dev/mapper/mpatha, /dev/sdc
sdb ONLINE 0 0 0 I am /dev/sdb1, /dev/sdb
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#5368
Allow `zfs unshare <protocol> -a` command to share or unshare all datasets
of a given protocol, nfs or smb.
Additionally, enable most of ZFS Test Suite zfs_share/zfs_unshare test cases.
To work around some Illumos-specific functionalities ($SHARE/$UNSHARE) some
function wrappers were added around them.
Finally, fix and issue in smb_is_share_active() that would leave SMB shares
exported when invoking 'zfs unshare -a'
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#3238Closes#5367
Now that ZED has internal fault diagnosis and the statechange event
is generated for faulted states, we can replace the io-notify and
checksum-notify zedlets with one based on statechange.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes#5383
These were named in the zed/Makefile.am as vdev_clear-blinkled.sh
and statechange-blinkled.sh causing bad symlinks to be created.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#5384
The phase 2 work primarily entails the Diagnosis Engine and
the Retire Agent modules. It also includes infrastructure
to support a crude FMD environment to host these modules.
The Diagnosis Engine consumes I/O and checksum ereports and
feeds them into a SERD engine which will generate a corres-
ponding fault diagnosis when the SERD engine fires. All the
diagnosis state data is collected into cases, one case per
vdev being tracked.
The Retire Agent responds to diagnosed faults by isolating
the faulty VDEV. It will notify the ZFS kernel module of
the new VDEV state (degraded or faulted). This agent is
also responsible for managing hot spares across pools.
When it encounters a device fault or a device removal it
replaces the device with an appropriate spare if available.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes#5343
The previous autoreplace code assumed that if you were using autoreplace, then
you also had the enclosure SES driver loaded. This could lead to autoreplace
not working if the SES driver wasn't loaded, or if it wasn't creating the
proper enclosure_device symlinks (which has happened). This patch removes
that assumption.
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#5363
avx512f should work on all AVX512 hardware, since it only uses
Foundation instructions.
avx512bw should be faster on hardware supporting the AVW512BW
extension. We can use full-width pshufb (instead of relying on the 256
bits AVX2 pshufb). As a side-effect, the code is also unrolled more.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.github@dolbeau.name>
Closes#5219
Sometimes it is desirable to specifically disable one or several
features directly on the 'zpool create' command line.
$ zpool create -o feature@<feature>=disabled ...
Original-patch-by: Turbo Fredriksson <turbo@bayour.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#3460Closes#5142Closes#5324
- Fix autoreplace behaviour on statechange-led.sh script.
ZED sends the following events on an auto-replace:
1. statechange: Disk goes UNAVAIL->ONLINE
2. statechange: Disk goes ONLINE->UNAVAIL
3. vdev_attach: Disk goes ONLINE
Events 1-2 happen when ZED first attempts to do an auto-online. When that
fails, ZED then tries an auto-replace, generating the vdev_attach event in #3.
In the previous code, statechange-led was only looking at the UNAVAIL->ONLINE
transition to turn off the LED. It ignored the #2 ONLINE->UNAVAIL transition,
assuming it was just the "old" VDEV going offline. This is problematic, as
a drive can go from ONLINE->UNAVAIL when it's malfunctioning, and we don't want
to ignore that.
This new patch correctly turns on the fault LED every time a drive becomes
UNAVAIL. It also monitors vdev_attach events to trigger turning off the LED
when an auto-replaced disk comes online.
- Remove unnecessary libdevmapper warning with --with-config=kernel
This fixes an unnecessary libdevmapper warning when building
--with-config=kernel. Kernel code does not use libdevmapper, so the warning
is not needed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#2375Closes#5312Closes#5331
Previously when a drive faulted, the statechange-led.sh script would lookup
the drive's LED sysfs entry in /sys/block/sd*/device/enclosure_device, and
turn it on. During testing we noticed that if you pulled out a drive, or if
the drive was so badly broken that it no longer appeared to Linux, that the
/sys/block/sd* path would be removed, and the script could not lookup the
LED entry.
To fix this, this patch looks up the disks's more persistent
"/sys/class/enclosure/X:X:X:X/Slot N" LED sysfs path at pool import. It then
passes that path to the statechange-led script to use, rather than having the
script look it up on the fly. This allows the script to turn on/off the slot
LEDs even when the drive is missing.
Closes#5309Closes#2375
1. Enable multipath autoreplace support for FMA.
This extends FMA autoreplace to work with multipath disks. This
requires libdevmapper to be installed at build time.
2. Turn on/off fault LEDs when VDEVs become degraded/faulted/online
Set ZED_USE_ENCLOSURE_LEDS=1 in zed.rc to have ZED turn on/off the enclosure
LED for a drive when a drive becomes FAULTED/DEGRADED. Your enclosure must
be supported by the Linux SES driver for this to work. The enclosure LED
scripts work for multipath devices as well. The scripts will clear the LED
when the fault is cleared.
3. Rate limit ZIO delay and checksum events so as not to flood ZED
ZIO delay and checksum events are rate limited to 5/sec in the zfs module.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#2449Closes#3017Closes#5159
CID 147643: Type: String not null terminated
- make sure that the string is null terminated before strlen
and fprintf.
CID 152204: Type: Copy into fixed size buffer
- since strlcpy isn't availabe here, use strncpy and terminate
the string manually.
CID 49339: Type: Buffer not null terminated
- since strlcpy isn't availabe here, terminate the string
manually before fprintf.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Closes#5283
First rename spare_cbdata_t cb -> spare_cb in print_status_config(),
to free up cb.
Using the structure removes the explicit parameters namewidth
and name_flags from several functions. Also use status_cbdata_t
for print_import_config(). This simplifies print_logs().
Remove the parameter 'verbose' for print_logs(). It does not really
mean verbose, it selected between the print_status_config and
print_import_config() paths. This selection is now done by
cb_print_config of spare_cbdata_t.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Håkan Johansson <f96hajo@chalmers.se>
Closes#5259
When array is passed as a parameter it degenerates into a
pointer so the sizeof(path) in is_shorthand_path() and always
get return value of 8, instead of the string length we want.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Closes#5198
The following new test cases need to have execute permissions set:
userquota/groupspace_003_pos.ksh
userquota/userquota_013_pos.ksh
userquota/userspace_003_pos.ksh
upgrade/upgrade_userobj_001_pos.ksh
upgrade/setup.ksh
upgrade/cleanup.ksh
The following source files accidentally were marked executable:
lib/libzpool/kernel.c
lib/libshare/nfs.c
lib/libzfs/libzfs_dataset.c
lib/libzfs/libzfs_util.c
tests/zfs-tests/cmd/rm_lnkcnt_zero_file/rm_lnkcnt_zero_file.c
tests/zfs-tests/cmd/dir_rd_update/dir_rd_update.c
cmd/zed/zed_exec.c
module/icp/core/kcf_sched.c
module/zfs/dsl_pool.c
module/zfs/arc.c
module/nvpair/nvpair.c
man/man5/zfs-module-parameters.5
Reviewed-by: GeLiXin <ge.lixin@zte.com.cn>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5241
Fixes ABI issues with fletcher4 code, adds support for
incremental updates, and adds ztest method for testing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes#5164
Introduce a make recipe for flake8 to enable python
style checking. Ensure all python scripts pass flake8.
Return an error code of 0 for arcstat.py -v and
dbufstat.py -v. Add test cases for python scripts.
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ian Lee <IanLee1521@gmail.com>
Closes#5230
This patch tracks dnode usage for each user/group in the
DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode
accounting have the key prefixed with "obj-" followed by the UID/GID
in string format (as done for the block accounting).
A new SPA feature has been added for dnode accounting as well as
a new ZPL version. The SPA feature must be enabled in the pool
before upgrading the zfs filesystem. During the zfs version upgrade,
a "quotacheck" will be executed by marking all dnode as dirty.
ZoL-bug-id: https://github.com/zfsonlinux/zfs/issues/3500
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Johann Lombardi <johann.lombardi@intel.com>
Both scripts were returning an error code of 1
when using the -v argument. -v should exit with
an error code of 0.
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Combine incrementally computed fletcher4 checksums. Checksums are combined
a posteriori, allowing for parallel computation on chunks to be implemented if
required. The algorithm is general, and does not add changes in each SIMD
implementation.
New test in ztest verifies incremental fletcher computations.
Checksum combining matrix for two buffers `a` and `b`, where `Ca` and `Cb` are
respective fletcher4 checksums, `Cab` is combined checksum, `s` is size of buffer
`b` (divided by sizeof(uint32_t)) is:
Cab[A] = Cb[A] + Ca[A]
Cab[B] = Cb[B] + Ca[B] + s * Ca[A]
Cab[C] = Cb[C] + Ca[C] + s * Ca[B] + s(s+1)/2 * Ca[A]
Cab[D] = Cb[D] + Ca[D] + s * Ca[C] + s(s+1)/2 * Ca[B] + s(s+1)(s+2)/6 * Ca[A]
NOTE: this calculation overflows for larger buffers. Thus, internally, the calculation
is performed on 8MiB chunks.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported by: Tony Hutter <hutter2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/4185
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee
Porting Notes:
This code is ported on top of the Illumos Crypto Framework code:
b5e030c8db
The list of porting changes includes:
- Copied module/icp/include/sha2/sha2.h directly from illumos
- Removed from module/icp/algs/sha2/sha2.c:
#pragma inline(SHA256Init, SHA384Init, SHA512Init)
- Added 'ctx' to lib/libzfs/libzfs_sendrecv.c:zio_checksum_SHA256() since
it now takes in an extra parameter.
- Added CTASSERT() to assert.h from for module/zfs/edonr_zfs.c
- Added skein & edonr to libicp/Makefile.am
- Added sha512.S. It was generated from sha512-x86_64.pl in Illumos.
- Updated ztest.c with new fletcher_4_*() args; used NULL for new CTX argument.
- In icp/algs/edonr/edonr_byteorder.h, Removed the #if defined(__linux) section
to not #include the non-existant endian.h.
- In skein_test.c, renane NULL to 0 in "no test vector" array entries to get
around a compiler warning.
- Fixup test files:
- Rename <sys/varargs.h> -> <varargs.h>, <strings.h> -> <string.h>,
- Remove <note.h> and define NOTE() as NOP.
- Define u_longlong_t
- Rename "#!/usr/bin/ksh" -> "#!/bin/ksh -p"
- Rename NULL to 0 in "no test vector" array entries to get around a
compiler warning.
- Remove "for isa in $($ISAINFO); do" stuff
- Add/update Makefiles
- Add some userspace headers like stdio.h/stdlib.h in places of
sys/types.h.
- EXPORT_SYMBOL *_Init/*_Update/*_Final... routines in ICP modules.
- Update scripts/zfs2zol-patch.sed
- include <sys/sha2.h> in sha2_impl.h
- Add sha2.h to include/sys/Makefile.am
- Add skein and edonr dirs to icp Makefile
- Add new checksums to zpool_get.cfg
- Move checksum switch block from zfs_secpolicy_setprop() to
zfs_check_settable()
- Fix -Wuninitialized error in edonr_byteorder.h on PPC
- Fix stack frame size errors on ARM32
- Don't unroll loops in Skein on 32-bit to save stack space
- Add memory barriers in sha2.c on 32-bit to save stack space
- Add filetest_001_pos.ksh checksum sanity test
- Add option to write psudorandom data in file_write utility
This re-use the framework established for SSE2, SSSE3 and
AVX2. However, GCC is using FP registers on Aarch64, so
unlike SSE/AVX2 we can't rely on the registers being left alone
between ASM statements. So instead, the NEON code uses
C variables and GCC extended ASM syntax. Note that since
the kernel explicitly disable vector registers, they
have to be locally re-enabled explicitly.
As we use the variable's number to define the symbolic
name, and GCC won't allow duplicate symbolic names,
numbers have to be unique. Even when the code is not
going to be used (e.g. the case for 4 registers when
using the macro with only 2). Only the actually used
variables should be declared, otherwise the build
will fails in debug mode.
This requires the replacement of the XOR(X,X) syntax
by a new ZERO(X) macro, which does the same thing but
without repeating the argument. And perhaps someday
there will be a machine where there is a more efficient
way to zero a register than XOR with itself. This affects
scalar, SSE2, SSSE3 and AVX2 as they need the new macro.
It's possible to write faster implementations (different
scheduling, different unrolling, interleaving NEON and
scalar, ...) for various cores, but this one has the
advantage of fitting in the current state of the code,
and thus is likely easier to review/check/merge.
The only difference between aarch64-neon and aarch64-neonx2
is that aarch64-neonx2 unroll some functions some more.
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes#4801
coverity scan CID:147536, type: Argument cannot be negative
- may write or close fd which is negative
coverity scan CID:147537, type: Argument cannot be negative
- may call dup2 with a negative fd
coverity scan CID:147538, type: Argument cannot be negative
- may read or fchown with a negative fd
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Closes#5185
When timeout is specified (-t), stop worker threads in the middle of work units.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Issue #5180Closes#5190
1.coverity scan CID:147445 function zfs_do_send in zfs_main.c
Buffer not null terminated (BUFFER_SIZE_WARNING)
2.coverity scan CID:147443 function zfs_do_bookmark in zfs_main.c
Buffer not null terminated (BUFFER_SIZE_WARNING)
3.coverity scan CID:147660 function main in zinject.c
Passing string argv[0] of unknown size to strcpy
By the way, the leak of g_zfs is fixed.
4.coverity scan CID: 147442 function make_disks in zpool_vdev.c
Buffer not null terminated (BUFFER_SIZE_WARNING)
5.coverity scan CID: 147661 function main in dir_rd_update.c
passing string cp1 of unknown size to strcpy
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes#5130
Fix misleading error message:
"The /dev/zfs device is missing and must be created.", if /etc/mtab is missing.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Eric Desrochers <eric.desrochers@canonical.com>
Closes#4680Closes#5029
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: David Quigley <david.quigley@intel.com>
This review covers the reading and writing of compressed arc headers, sharing
data between the arc_hdr_t and the arc_buf_t, and the implementation of a new
dbuf cache to keep frequently access data uncompressed.
I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs
off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block
for that DVA. The physical block may or may not be compressed. If compressed
arc is enabled and the block on-disk is compressed, then the b_pdata will match
the block on-disk and remain compressed in memory. If the block on disk is not
compressed, then neither will the b_pdata. Lastly, if compressed arc is
disabled, then b_pdata will always be an uncompressed version of the on-disk
block.
Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict
any arc_buf_t's that are no longer referenced. This means that the arc will
primarily have compressed blocks as the arc_buf_t's are considered overhead and
are always uncompressed. When a consumer reads a block we first look to see if
the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new
arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If
the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t
and bcopy the uncompressed contents from the first arc_buf_t to the new one.
Writing to the compressed arc requires that we first discard the b_pdata since
the physical block is about to be rewritten. The new data contents will be
passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we
will copy the physical block contents to a newly allocated b_pdata.
When an l2arc is inuse it will also take advantage of the b_pdata. Now the
l2arc will always write the contents of b_pdata to the l2arc. This means that
when compressed arc is enabled that the l2arc blocks are identical to those
stored in the main data pool. This provides a significant advantage since we
can leverage the bp's checksum when reading from the l2arc to determine if the
contents are valid. If the compressed arc is disabled, then we must first
transform the read block to look like the physical block in the main data pool
before comparing the checksum and determining it's valid.
OpenZFS-issue: https://www.illumos.org/issues/6950
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7fc10f0
Issue #5078
This first phase brings over the ZFS SLM module, zfs_mod.c, to handle
auto operations in response to disk events. Disk event monitoring is
provided from libudev and generates the expected payload schema for
zfs_mod. This work leverages the recently added devid and phys_path
strings in the vdev label.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#4673
The argument processing is zhack makes the assumption that getopt()
will not permute argv. This isn't true for the GNU implementation of
getopt() unless the optstring is prefixed with a '+'. In which case
this is equivalent to setting the POSIXLY_CORRECT environment variable
In addition, update the usage() and optstrings to reflect the existing
supported options.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: liaoyuxiangqin <guo.yong33@zte.com.cn>
Closes#5047
Erroneous access detected by gcc UndefinedBehaviorSanitizer:
`zdb.c:2424:7: runtime error: index 112 out of bounds for type 'uint64_t [112]'`
Fix: increase histogram size by 1 to accommodate all possible sizes.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4934
Issue #4883
Authored by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Signed-off-by: Don Brady <don.brady@intel.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/5997
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1437283
Porting Notes:
In addition to the OpenZFS changes this patch realigns the events
with those found in OpenZFS.
Events which would be logged as sysevents on illumos have been
been mapped to the 'sysevent' class for Linux. In addition, several
subclass names have been changed to match what is used in OpenZFS.
In all cases this means a '.' was changed to an '_' in the subclass.
The scripts provided by ZoL have been updated, however users which
provide scripts for any of the following events will need to rename
them based on the new subclass names.
ereport.fs.zfs.config.sync sysevent.fs.zfs.config_sync
ereport.fs.zfs.zpool.destroy sysevent.fs.zfs.pool_destroy
ereport.fs.zfs.zpool.reguid sysevent.fs.zfs.pool_reguid
ereport.fs.zfs.vdev.remove sysevent.fs.zfs.vdev_remove
ereport.fs.zfs.vdev.clear sysevent.fs.zfs.vdev_clear
ereport.fs.zfs.vdev.check sysevent.fs.zfs.vdev_check
ereport.fs.zfs.vdev.spare sysevent.fs.zfs.vdev_spare
ereport.fs.zfs.vdev.autoexpand sysevent.fs.zfs.vdev_autoexpand
ereport.fs.zfs.resilver.start sysevent.fs.zfs.resilver_start
ereport.fs.zfs.resilver.finish sysevent.fs.zfs.resilver_finish
ereport.fs.zfs.scrub.start sysevent.fs.zfs.scrub_start
ereport.fs.zfs.scrub.finish sysevent.fs.zfs.scrub_finish
ereport.fs.zfs.bootfs.vdev.attach sysevent.fs.zfs.bootfs_vdev_attach
Also print decompress progress to stderr so it wouldn't pollute raw output
with r flag.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4956
As of gcc 6.1.1 20160621 (Red Hat 6.1.1-3) an array bounds warnings
is detected in the zdb the dump_object() function. The analysis is
correct but difficult to interpret because this is implemented as a
macro. Rework the ZDB_OT_NAME in to a function and remove the case
detected by gcc which is a side effect of the DMU_OT_IS_VALID() macro.
zdb.c: In function ‘dump_object’:
zdb.c:1931:288: error: array subscript is outside array bounds
[-Werror=array-bounds]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes#4907
Leaks reported by using AddressSanitizer, GCC 6.1.0
Direct leak of 4097 byte(s) in 1 object(s) allocated from:
#1 0x414f73 in process_options cmd/ztest/ztest.c:721
Direct leak of 5440 byte(s) in 17 object(s) allocated from:
#1 0x41bfd5 in umem_alloc ../../lib/libspl/include/umem.h:88
#2 0x41bfd5 in ztest_zap_parallel cmd/ztest/ztest.c:4659
#3 0x4163a8 in ztest_execute cmd/ztest/ztest.c:5907
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4896
Here's the problem - on 4K native devices in userland on
Linux using O_DIRECT, buffers must be 4K aligned or I/O
will fail with EINVAL, causing zdb (and others) to coredump.
Since userland probably doesn't need optimized buffer caches,
we just force 4K alignment on everything.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes#4479
Commit 519129f added support to multi-thread 'zpool import' for
the case where block devices are scanned for under /dev/. This
commit generalizes that logic and applies it to the case where
device names are acquired from libblkid.
The zpool_find_import_scan() and zpool_find_import_blkid()
functions create an AVL tree containing each device name. Each
entry in this tree is dispatched to a taskq where the function
zpool_open_func() validates the device by opening it and reading
the label. This may result in additional entries being added
to the tree and those device paths being verified.
This is largely how the upstream OpenZFS code behaves but due to
significant differences the non-Linux code has been dropped for
readability. Additionally, this code makes use of taskqs and
kmutexs which are normally not available to the command line tools.
Special care has been taken to allow their use in the import
functions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#4794
- Implementation lock replaced with atomic variable
- Trailing whitespace is removed from user specified parameter, to enhance
experience when using commands that add newline, e.g. `echo`
- raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813
- silence `cppcheck` in vdev_raidz, partial solution of Issue #1392
- Minor fixes and cleanups
- Enable use of original parity methods in [fastest] configuration.
New opaque original ops structure, representing native methods, is added
to supported raidz methods. Original parity methods are executed if selected
implementation has NULL fn pointer.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4813
Issue #1392
Commit 7f60329 removed several kstats which arc_summary.py read.
Remove these kstats from arc_summary.py in the same way this was
handled in FreeNAS.
FreeNAS-commit: https://github.com/freenas/freenas/commit/3901f73
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4695
2605 want to resume interrupted zfs send
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/2605
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12
6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6980
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f
Porting notes:
- All rsend and snapshop tests enabled and updated for Linux.
- Fix misuse of input argument in traverse_visitbp().
- Fix ISO C90 warnings and errors.
- Fix gcc 'missing braces around initializer' in
'struct send_thread_arg to_arg =' warning.
- Replace 4 argument fletcher_4_native() with 3 argument version,
this change was made in OpenZFS 4185 which has not been ported.
- Part of the sections for 'zfs receive' and 'zfs send' was
rewritten and reordered to approximate upstream.
- Fix mktree xattr creation, 'user.' prefix required.
- Minor fixes to newly enabled test cases
- Long holds for volumes allowed during receive for minor registration.
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3542
This reverts commit d0de2e82df which
introduced a new test case to ztest which is failing occasionally
during automated testing. The change is being reverted until
the issue can be fully investigated.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4754
This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.
Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite
New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
module load, the parameter will only accept first 3 options, and
the other implementations can be set once module is finished
loading. Possible values for this option are:
"fastest" - use the fastest math available
"original" - use the original raidz code
"scalar" - new scalar impl
"sse" - new SSE impl if available
"avx2" - new AVX2 impl if available
See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4328
ZFS allows for specific permissions to be delegated to normal users
with the `zfs allow` and `zfs unallow` commands. In addition, non-
privileged users should be able to run all of the following commands:
* zpool [list | iostat | status | get]
* zfs [list | get]
Historically this functionality was not available on Linux. In order
to add it the secpolicy_* functions needed to be implemented and mapped
to the equivalent Linux capability. Only then could the permissions on
the `/dev/zfs` be relaxed and the internal ZFS permission checks used.
Even with this change some limitations remain. Under Linux only the
root user is allowed to modify the namespace (unless it's a private
namespace). This means the mount, mountpoint, canmount, unmount,
and remount delegations cannot be supported with the existing code. It
may be possible to add this functionality in the future.
This functionality was validated with the cli_user and delegation test
cases from the ZFS Test Suite. These tests exhaustively verify each
of the supported permissions which can be delegated and ensures only
an authorized user can perform it.
Two minor bug fixes were required for test-running.py. First, the
Timer() object cannot be safely created in a `try:` block when there
is an unconditional `finally` block which references it. Second,
when running as a normal user also check for scripts using the
both the .ksh and .sh suffixes.
Finally, existing users who are simulating delegations by setting
group permissions on the /dev/zfs device should revert that
customization when updating to a version with this change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#362Closes#434Closes#4100Closes#4394Closes#4410Closes#4487
New functionality:
- Preserves existing scalar implementation.
- Adds AVX2 optimized Fletcher-4 computation.
- Fastest routines selected on module load (benchmark).
- Test case for Fletcher-4 added to ztest.
New zcommon module parameters:
- zfs_fletcher_4_impl (str): selects the implementation to use.
"fastest" - use the fastest version available
"cycle" - cycle trough all available impl for ztest
"scalar" - use the original version
"avx2" - new AVX2 implementation if available
Performance comparison (Intel i7 CPU, 1MB data buffers):
- Scalar: 4216 MB/s
- AVX2: 14499 MB/s
See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl`
to get list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4330
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6531
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130
Porting notes:
- Added new IO delay tracepoints, and moved common ZIO tracepoint macros
to a new trace_common.h file.
- Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function.
- Updated zinject man page
- Updated zpool_scrub test files
This is a purely cosmetical change, to consistently prefer one of
two (both acceptable) choises for the word parsable in documentation and
code. I don't really care which to use, but acording to wiktionary
https://en.wiktionary.org/wiki/parsable#English parsable is preferred.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4682
Add argument format to print_one_column(), and use it to call
zfs_nicenum_format with, instead of just zfs_nicenum. Don't print "%"
for fragmentation or capacity percent values.
The calls to print_one_colum is made with ZFS_NICENUM_RAW if
cb->cb_literal (zpool list called with -p), and ZFS_NICENUM_1024 if not.
Also zpool_get_prop is modified to don't add "%" or "x" if literal.
Signed-off-by: Christer Ekholm <che@chrekh.se>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov
Closes#4657
This reverts commit 8302528617 and
ebecfcd699 which broke the build.
While these patches do apply cleanly and passed previous test
runs they need to be updated to account for the changes made in
commit 241b541574.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3878
Add a force option to allow zhack to add features which are
part of the known set of supported features. By default
this is disabled.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3878
The zfs range lock interface no longer tightly depends on a
znode_t and therefore can be used in ztest. This allows the
previous ztest specific implementation to be removed, and for
additional test coverage of the shared version.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4023
Issue #4024
3993 zpool(1M) and zfs(1M) should support -p for "list" and "get"
4700 "zpool get" doesn't support -H or -o options
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/3993
OpenZFS-issue: https://www.illumos.org/issues/4700
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c58b352
Porting notes:
I removed ZoL's zpool_get_prop_literal() in favor of
zpool_get_prop(..., boolean_t literal) since that's what OpenZFS
uses. The functionality is the same.
5669 altroot not set in zpool create when specified with -o
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/5669
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c423721Closes#4594
6659 nvlist_free(NULL) is a no-op
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/6659https://github.com/illumos/illumos-gate/commit/aab83bb
Ported-by: David Quigley <dpquigl@davequigley.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4566
When ZFS partitions a block device it must wait for udev to create
both a device node and all the device symlinks. This process takes
a variable length of time and depends on factors such how many links
must be created, the complexity of the rules, etc. Complicating
the situation further it is not uncommon for udev to create and
then remove a link multiple times while processing the udev rules.
Given the above, the existing scheme of waiting for an expected
partition to appear by name isn't 100% reliable. At this point
udev may still remove and recreate think link resulting in the
kernel modules being unable to open the device.
In order to address this the zpool_label_disk_wait() function
has been updated to use libudev. Until the registered system
device acknowledges that it in fully initialized the function
will wait. Once fully initialized all device links are checked
and allowed to settle for 50ms. This makes it far more likely
that all the device nodes will exist when the kernel modules
need to open them.
For systems without libudev an alternate zpool_label_disk_wait()
was updated to include a settle time. In addition, the kernel
modules were updated to include retry logic for this ENOENT case.
Due to the improved checks in the utilities it is unlikely this
logic will be invoked. However, if the rare event it is needed
it will prevent a failure.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#4523Closes#3708Closes#4077Closes#4144Closes#4214Closes#4517
When partitioning a device a name may be specified for each partition.
Internally zfs doesn't use this partition name for anything so it
has always just been set to "zfs".
However this isn't optimal because udev will create symlinks using
this name in /dev/disk/by-partlabel/. If the name isn't unique
then all the links cannot be created.
Therefore a random 64-bit value has been added to the partition
label, i.e "zfs-1234567890abcdef". Additional information could
be encoded here but since partitions may be reused that might
result in confusion and it was decided against.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#4517
Also enable lazytime in mount.zfs
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4482
This is foundational work for ZED.
Updates a leaf vdev's persistent device strings on Linux platform
* only applies for a dedicated leaf vdev (aka whole disk)
* updated during pool create|add|attach|import
* used for matching device matching during auto-{online,expand,replace}
* stored in a leaf disk config label (i.e. alongside 'path' NVP)
* can opt-out using env var ZFS_VDEV_DEVID_OPT_OUT=YES
Some examples:
path: '/dev/sdb1'
devid: 'scsi-350000394a8ca4fbc-part1'
phys_path: 'pci-0000:04:00.0-sas-0x50000394a8ca4fbf-lun-0'
path: '/dev/mapper/mpatha'
devid: 'dm-uuid-mpath-35000c5006304de3f'
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2856Closes#3978Closes#4416
The `zfs userspace` squashes all entries with unresolved numeric
values into a single output entry due to the comparsion always
made by the string name which is empty in case of unresolved IDs.
Fix this by falling to a numerical comparison when either one
of string values is not found. This then compares any numerical
values after all with a name resolved.
Signed-off-by: Pavel Boldin <boldin.pavel@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4440
This issue was caused by calling `thread_init()` and `thread_fini()`
multiple times resulting in `kthread_key` being invalid. To resolve
the issue the explicit calls to `thread_init()` and `thread_fini()`
required by the `zpool` command have been moved in to the command.
Consumers such as `zdb` and `zhack` perform the same initialized
through `kernel_init()` and `kernel_fini()`.
Resolving this issue allows multiple additional test cases to
be enabled.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#4331
I noticed during code review of zfsonlinux/zfs#4385 that the author of a
commit had peppered the various Makefile.am files with `$(TIRPC_LIBS)`
when putting it into `lib/libspl/Makefile.am` should have sufficed. Upon
further examination, it seems that he had copied what we do with
`$(ZLIB)`. We also have a bit of that with `-ldl` too. Unfortunately,
what we do is wrong, so lets fix it to set a good example for future
contributors.
In addition, we have multiple `-lz` and `-luuid` passed to the compiler
because each `AC_CHECK_LIB` adds it to `$LIBS`. That is somewhat
annoying to see, so we switch to `AC_SEARCH_LIBS` to avoid it. This is
consistent with the recommendation to use `AC_SEARCH_LIBS` over
`AC_CHECK_LIB` by autotools upstream:
https://www.gnu.org/software/autoconf/manual/autoconf-2.66/html_node/Libraries.html
In an ideal world, this would translate into improvements in ELF's
`DT_NEEDED` entries, but that is not the case because of a couple of
bugs in libtool.
The first bug causes libtool to overlink by using static link
dependencies for dynamic linking:
https://wiki.mageia.org/en/Overlinking_issues_in_packaging#libtool_issues
The workaround for this should be to pass `-Wl,--as-needed` in
`LDFLAGS`. That leads us to the second bug, where libtool passes
`LDFLAGS` after the libraries are specified and `ld` will only honor
`--as-needed` on libraries specified before it:
https://sigquit.wordpress.com/2011/02/16/why-asneeded-doesnt-work-as-expected-for-your-libraries-on-your-autotools-project/
There are a few possible workarounds for the second bug. One is to
either patch the compiler spec file to specify `-Wl,--as-needed` or pass
`-Wl,--as-needed` via `CC` like `CC='gcc -Wl,--as-needed'` so that it is
specified early. Another is to patch ltmain.sh like Gentoo does:
https://gitweb.gentoo.org/repo/gentoo.git/tree/eclass/ELT-patches/as-needed
Without one of those workarounds, this cleanup provides no benefit in
terms of `DT_NEEDED` entry generation. It should still be an improvement
because it nicely simplifies the code while encouraging good habits when
patching autotools scripts.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4426
When checking a whole disk to see if it can be safely added to
the pool a variety of checks are done. One of those checks is
to attempt to determine the partition information and scan all
the partitions for existing filesystems.
Since ZoL contains a EFI library this partition scanning is
easy to do for GPT partitioned disks. However, for non-GPT
partitioned disks (MBR/EBR) things are a bit harder. The lack of
a convenient library means non-GPT partitioned disks will not
have all their partitions checked. For this reason, the default
behavior was to require the force option. For example:
invalid vdev specification
use '-f' to override the following errors:
/dev/vdb does not contain an GPT label but it may contain partition
information in the MBR.
However in practice requiring the force option for this case is
counter-intuitively less safe. The reason is because only the first
error is returned. By passing the force option it will suppress
this first warning and potentially others you were not aware of.
Therefore this patch inverts the default behavior for non-GPT
formated disks (unformatted, MBR/EBR, etc). If no GPT table is
detected and there is no file system detected on the provided
block device. Then it will be assumed that block device is safe
to use.
Longer term it would be nice to see MBR/EBR scanning added to
the utilities. This should be fairly straight forward to do.
However these days it's somewhat less critical because Linux
defaults to GPT partition tables for devices 2TB or larger.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2660Closes#2274
Historically libblkid support was detected as part of configure
and optionally enabled. This was done because at the time support
for detecting ZFS pool vdevs had just be added to libblkid and
those updated packages were not yet part of many distributions.
This is no longer the case and any reasonably current distribution
will ship a version of libblkid which can detect ZFS pool vdevs.
This patch makes libblkid mandatory at build time and libblkid
the preferred method of scanning for ZFS pools. For distributions
which include a modern version of libblkid there is no change in
behavior. Explicitly scanning the default search paths is still
supported and can be enabled with the '-s' command line option.
Additionally making libblkid mandatory means that the 'zpool create'
command can reliably detect if a specified device has an existing
non-ZFS filesystem (ext4, xfs) and print a warning.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2448
In zed's _finish_daemonize(), /dev/null is open()d onto a temporary
file descriptor which is then dup()d onto stdin, stdout, and stderr.
But if file descriptors 0, 1, or 2 are not already open at the start
of this function, then the temporary file descriptor will fall within
this range and be inadvertently closed when the function cleans up.
This commit adds a check to prevent inadvertently closing this
(presumably temporary) file descriptor when it shouldn't.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4384
Commit d2f3e29 introduced the -p option which outputs full paths
for vdevs to multiple zpool subcommands. When this was merged
there was no conflict for this flag letter. However it's certain
there will be a conflict with the -p (parsable) flag used by other
subcommands. Therefore, -p is being changed to -P to avoid this.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4368
The following options have been added to the zpool add, iostat,
list, status, and split subcommands. The default behavior was
not modified, from zfs(8).
-g Display vdev GUIDs instead of the normal short
device names. These GUIDs can be used in-place of
device names for the zpool detach/off‐
line/remove/replace commands.
-L Display real paths for vdevs resolving all symbolic
links. This can be used to lookup the current block
device name regardless of the /dev/disk/ path used
to open it.
-p Display full paths for vdevs instead of only the
last component of the path. This can be used in
conjunction with the -L flag.
This behavior may also be enabled using the following environment
variables.
ZPOOL_VDEV_NAME_GUID
ZPOOL_VDEV_NAME_FOLLOW_LINKS
ZPOOL_VDEV_NAME_PATH
This change is based on worked originally started by Richard Yao
to add a -g option. Then extended by @ilovezfs to add a -L option
for openzfsonosx. Those changes have been merged, re-factored,
a -p option added and extended to all relevant zpool subcommands.
Original-patch-by: Richard Yao <ryao@gentoo.org>
Extended-by: ilovezfs <ilovezfs@icloud.com>
Extended-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: ilovezfs <ilovezfs@icloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2011Closes#4341
5767 fix several problems with zfs test suite
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/5767https://github.com/illumos/illumos-gate/commit/52244c0
Porting Notes:
- Only the updates to zpool_main.c were kept because the ZFS test
suite is not currently part of the ZoL source tree. The test
suite itself should be updated to include the latest versions
of the tests once we're running it for every commit
- Fixes `zpool list` output.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Currently, only the 'b' flag takes an argument which is an offset into
the block at which a blkptr should be decoded. The index into the flag
string needed to be updated after parsing an argument.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4304
mount.zfs is called by convention (and util-linux) with arguments
last, i.e.
% mount.zfs <dataset> <mountpoint> -o <options>
This is not a problem on glibc since GNU getopt(3) will reorder the
arguments. However, alternative libc such as musl libc (or glibc with
$POSIXLY_CORRECT set) will not permute argv and fail to parse the -o
<options>. Use getopt_long so musl will permute arguments.
Signed-off-by: Christian Neukirchen <chneukirchen@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4222
6298 zfs_create_008_neg and zpool_create_023_neg need to be updated
for large block support
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/6298https://github.com/illumos/illumos-gate/commit/e9316f7
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4217
3557 dumpvp_size is not updated correctly when a dump zvol's size is changed
3558 setting the volsize on a dump device does not return back ENOSPC
3559 setting a volsize larger than the space available sometimes succeeds
3560 dumpadm should be able to remove a dump device
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Albert Lee <trisk@nexenta.com>
References:
https://www.illumos.org/issues/3559https://github.com/illumos/illumos-gate/commit/c61ea56
Porting notes:
- Internal zvol.c changes not applied due to implementation differences.
The external interface and behavior was already consistent with the
latest upstream code.
- Retired 2.6.28 HAVE_CHECK_DISK_SIZE_CHANGE configure check. All
supported kernels (2.6.32 and newer) provide this interface.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4217
5039 ztest should default to larger device sizes
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/5039https://github.com/illumos/illumos-gate/commit/539eed8
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
4891 want zdb option to dump all metadata
Reviewed by: Sonu Pillai <sonu.pillai@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
We'd like a way for zdb to dump metadata in a machine-readable
format, so that we can bring that back from a customer site for
in-house diagnosis. Think of it as a crash dump for zpools,
which can be used for post-mortem analysis of a malfunctioning
pool
References:
https://www.illumos.org/issues/4891https://github.com/illumos/illumos-gate/commit/df15e41
Porting notes:
- [cmd/zdb/zdb.c]
- a5778ea zdb: Introduce -V for verbatim import
- In main() getopt 'opt' variable removed and the code was
brought back in line with illumos.
- [lib/libzpool/kernel.c]
- 1e33ac1 Fix Solaris thread dependency by using pthreads
- f0e324f Update utsname support
- 4d58b69 Fix vn_open/vn_rdwr error handling
- In vn_open() allocate 'dumppath' on heap instead of stack
- Properly handle 'dump_fd == -1' error path
- Free 'realpath' after added vn_dumpdir_code block
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add missing getopt specifier for `zdb -V` verbatim option and
set flag with correct bitwise operator.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To make arc_summary.py and dbufstat.py compatible with python3
some minor fixes were required, this was done automatically by
`2to3 -w arc_summary.py` and `2to3 -w dbufstat.py`.
Signed-off-by: Hajo Möller <dasjoe@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Reviewed-by: Richard Laager <rlaager@wiktel.com>
2077 lots of unreachable breaks in illumos gate
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/2077https://github.com/illumos/illumos-gate/commit/33f5ff1
Porting notes:
- Only one file of the original patch applied to ZFS
- Minor formating change to align copyright block with upstream
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
5745 zfs set allows only one dataset property to be set at a time
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Richard PALO <richard@NetBSD.org>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Rich Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/5745https://github.com/illumos/illumos-gate/commit/3092556
Porting notes:
- Fix the missing braces around initializer, zfs_cmd_t zc = {"\0"};
- Remove extra format argument in zfs_do_set()
- Declare at the top:
- zfs_prop_t prop;
- nvpair_t *elem;
- nvpair_t *next;
- int i;
- Additionally initialize:
- int added_resv = 0;
- zfs_prop_t prop = 0;
- Assign 0 install of NULL for uint64_t types.
- zc->zc_nvlist_conf = '\0';
- zc->zc_nvlist_src = '\0';
- zc->zc_nvlist_dst = '\0';
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3574
Commit ca0bf58d to address arcs_mtx contention removed column "index"
from the output of kstats/dbuf.
dbufstat.py was not updated to reflect this, which causes it to crash
when run with -bx
This removes "index" from hardcoded lists of columns.
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4096
This reverts commit 2026196230.
It is no longer necessary now that we pass -DDEBUG unconditionally.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4095
Illumos unconditionally builds zdb and ztest with -DDEBUG. This helps
catch bugs and eliminates the need for commits like
2026196230, which changed ASSERTs to
VERIFYs. The following files in the illumos tree show this:
usr/src/cmd/zdb/Makefile.com
usr/src/cmd/ztest/Makefile.com
Given the usefulness of having early failure in these tools, we should
do it too.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4095
Running arcstat.py -x currently throws KeyError due to rmis being
absent, it was removed in commit ca0bf58.
Signed-off-by: cable2999 <cable2999@users.noreply.github.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3931
5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/5959https://github.com/illumos/illumos-gate/commit/ca0cc39
Porting notes:
illumos code doesn't check for feature_get_refcount() returning
ENOTSUP (which means feature is disabled) in zdb. zfsonlinux added
a check in https://github.com/zfsonlinux/zfs/commit/784652c
due to #3468. The check was reintroduced here.
Ported-by: Witaut Bajaryn <vitaut.bayaryn@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3965
When dumping a block on a little endian system the data must be
byte swapped to display correctly. Example incorrect output:
$ echo 0123456789abcdef > aaa
$ zdb -eR pp 3:1ee00:200
3:1ee00:200
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
000000: 3736353433323130 6665646362613938 0123456789abcdef
000010: 000000000000000a 0000000000000000 ................
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4020
The current zdb calling behaviour is really fragile, and is guaranteed to
segfault if ztest is not installed in either /sbin or /usr/sbin. With this
patch, the ztest will try to call zdb in the following order.
1. Use environmental variable ZDB_PATH if provided.
2. If ztest resides in build tree, guess the in tree zdb path.
3. Just pass zdb to popen and let it search it in PATH.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3126
Avoid buffer overrun on all-zero bpobj subobjects by using signed
array index. Also fix the type cast on the printf() argument.
Signed-off-by: Tim Chase <tim@onlight.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3905
As part of the large block support effort, it makes sense to add
support for large blocks to **zpios(1)**. The specifying of a zfs
block size for zpios is optional and will default to 128K if the
block size is not specified.
`zpios ... -S size | --blocksize size ...`
This will use *size* ZFS blocks for each test, specified as a comma
delimited list with an optional unit suffix. The supported range is
powers of two from 128K through 16M. A range of block sizes can be
tested as follows: `-S 128K,256K,512K,1M`
Example run below
(non realistic results from a VM and output abbreviated for space)
```
--regioncount=750 --regionsize=8M --chunksize=1M --offset=4K
--threaddelay=0 --cleanup --human-readable --verbose --cleanup
--blocksize=128K,256K,512K,1M
th-cnt rg-cnt rg-sz ch-sz blksz wr-data wr-bw rd-data rd-bw
---------------------------------------------------------------------
4 750 8m 1m 128k 5g 90.06m 5g 93.37m
4 750 8m 1m 256k 5g 79.71m 5g 99.81m
4 750 8m 1m 512k 5g 42.20m 5g 93.14m
4 750 8m 1m 1m 5g 35.51m 5g 89.36m
8 750 8m 1m 128k 5g 85.49m 5g 90.81m
8 750 8m 1m 256k 5g 61.42m 5g 99.24m
8 750 8m 1m 512k 5g 49.09m 5g 108.78m
16 750 8m 1m 128k 5g 86.28m 5g 88.73m
16 750 8m 1m 256k 5g 64.34m 5g 93.47m
16 750 8m 1m 512k 5g 68.84m 5g 124.47m
16 750 8m 1m 1m 5g 53.97m 5g 97.20m
---------------------------------------------------------------------
```
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3795Closes#2071
All other sections use a tab, which makes them easy to parse. Only the
"scan:" section had its continuation lines indented with four spaces.
This makes them consistent with the others.
Signed-off-by: Remy Blank <remy.blank@pobox.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#3769
Internally ZFS keeps a small log to facilitate debugging. By default
the log is disabled, to enable it set zfs_dbgmsg_enable=1. The contents
of the log can be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file.
Writing 0 to this proc file clears the log.
$ echo 1 >/sys/module/zfs/parameters/zfs_dbgmsg_enable
$ echo 0 >/proc/spl/kstat/zfs/dbgmsg
$ zpool import tank
$ cat /proc/spl/kstat/zfs/dbgmsg
1 0 0x01 -1 0 2492357525542 2525836565501
timestamp message
1441141408 spa=tank async request task=1
1441141408 txg 70 open pool version 5000; software version 5000/5; ...
1441141409 spa=tank async request task=32
1441141409 txg 72 import pool version 5000; software version 5000/5; ...
1441141414 command: lt-zpool import tank
Note the zfs_dbgmsg() and dprintf() functions are both now mapped to
the same log. As mentioned above the kernel debug log can be accessed
though the /proc/spl/kstat/zfs/dbgmsg kstat. For user space consumers
log messages are immediately written to stdout after applying the
ZFS_DEBUG environment variable.
$ ZFS_DEBUG=on ./cmd/ztest/ztest -V
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#3728
Add the required kernel side infrastructure to parse arbitrary
mount options. This enables us to support temporary mount
options in largely the same way it is handled on other platforms.
See the 'Temporary Mount Point Properties' section of zfs(8)
for complete details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#985Closes#3351
Add new keyword 'slot' to vdev_id.conf
This selects from where to get the slot number for a SAS/SATA disk
Needed to enable access to the physical position of a disk in a
Supermicro 2027R-AR24NV .
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#3693
At verbosity levels of 6 or greater, ztest_dsl_prop_set_uint64() attempts
to display the value of all properties as indexed values regardless of
whether the property is an indexed value or simply an un-indexed integer.
This patch causes the numeric value of the property to be displayed if
zfs_prop_index_to_string() fails.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3649
This commit reworks the zed_notify_email() function to allow
configuration of the mail executable and command-line arguments.
ZED_EMAIL_PROG specifies the name or path of the executable responsible
for sending notifications via email. This variable defaults to "mail".
ZED_EMAIL_OPTS specifies command-line options passed to ZED_EMAIL_PROG.
The following keyword substitutions are performed:
- @ADDRESS@ is replaced with the recipient email address(es)
- @SUBJECT@ is replaced with the notification subject
This variable defaults to "-s '@SUBJECT@' @ADDRESS@".
ZED_EMAIL_ADDR replaces ZED_EMAIL (although the latter is retained
for backward compatibility). This variable can contain multiple
addresses as long as they are delimited by whitespace.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3634Closes#3631
This commit fixes the two adjacent spaces that appear in zed_log_err()
messages when ZEVENT_EID is undefined.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Build products from an out of tree build should be written
relative to the build directory. Sources should be referred
to by their locations in the source directory.
This is accomplished by adding the 'src' and 'obj' variables
for the module Makefile.am, using relative paths to reference
source files, and by setting VPATH when source files are not
co-located with the Makefile. This enables the following:
$ mkdir build
$ cd build
$ ../configure \
--with-spl=$HOME/src/git/spl/ \
--with-spl-obj=$HOME/src/git/spl/build
$ make -s
This change also has the advantage of resolving the following
warning which is generated by modern versions of automake.
Makefile.am:00: warning: source file 'xxx' is in a subdirectory,
Makefile.am:00: but option 'subdir-objects' is disabled
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1082
5118 When verifying or creating a storage pool, error messages
only show one device
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <boris.protopopov@me.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://github.com/illumos/illumos-gate/commit/75fbdf9https://www.illumos.org/issues/5118
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3567
4966 zpool list iterator does not update output
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://github.com/illumos/illumos-gate/commit/cd67d23https://www.illumos.org/issues/4966
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3566
Allocating SPA_MAXBLOCKSIZE on the stack is a bad idea (even with the
old 128K size). Use malloc instead when allocating temporary block
buffer memory.
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3522
Skip large blocks feature refcount checking if feature is disabled.
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3468
Both the 'zfs create' and 'zfs clone' commands are expected to
automatically mount and share new filesystems. Since this is common
functionality it has been moved in to a shared helper function.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3459
sysstat's iostat omits the first report when the -y option is used.
This patch adds that functionality and omits the first report with
statistics since system boot.
Signed-off-by: Hajo Möller <dasjoe@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3439
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Porting notes and other significant code changes:
The illumos 5368 patch (ARC should cache more metadata), which
was never picked up by ZoL, is mostly reverted by this patch.
Since ZoL relies on the kernel asynchronously calling the shrinker to
actually reap memory, the shrinker wakes up arc_reclaim_waiters_cv every
time it runs.
The arc_adapt_thread() function no longer calls arc_do_user_evicts()
since the newly-added arc_user_evicts_thread() calls it periodically.
Notable conflicting ZoL commits which conflicted with this patch or
whose effects are either duplicated or un-done by this patch:
302f753 - Integrate ARC more tightly with Linux
39e055c - Adjust arc_p based on "bytes" in arc_shrink
f521ce1 - Allow "arc_p" to drop to zero or grow to "arc_c"
77765b5 - Remove "arc_meta_used" from arc_adjust calculation
94520ca - Prune metadata from ghost lists in arc_adjust_meta
Trace support for multilist_insert() and multilist_remove() has been
added and produces the following output:
fio-12498 [077] .... 112936.448324: zfs_multilist__insert: ml { offset 240 numsublists 80 sublistidx 63 }
fio-12498 [077] .... 112936.448347: zfs_multilist__remove: ml { offset 240 numsublists 80 sublistidx 29 }
The following arcstats have been removed:
recycle_miss - Used by arcstat.py and arc_summary.py, both of which
have been updated appropriately.
l2_writes_hdr_miss
The following arcstats have been added:
evict_not_enough - Number of times arc_evict_state() was unable to
evict enough buffers to reach its target amount.
evict_l2_skip - Number of times arc_evict_hdr() skipped eviction
because it was being written to the l2arc.
l2_writes_lock_retry - Replaces l2_writes_hdr_miss. Number of times
l2arc_write_done() failed to acquire hash_lock (and re-tries).
arc_meta_min - Shows the value of the zfs_arc_meta_min module
parameter (see below).
The "index" column of the "dbuf" kstat has been removed since it doesn't
have a direct analog in the new multilist scheme. Additional multilist-
related stats could be added in the future but would likely require
extensions to the mulilist API.
The following module parameters have been added:
zfs_arc_evict_batch_limit - Number of ARC headers to free per sub-list
before moving on to the next sub-list.
zfs_arc_meta_min - Enforce a floor on the amount of metadata in
the ARC.
zfs_arc_num_sublists_per_state - Number of multilist sub-lists per
ARC state.
zfs_arc_overflow_shift - Controls amount by which the ARC must exceed
the target size to be considered "overflowing".
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov
5408 managing ZFS cache devices requires lots of RAM
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Porting notes:
Due to the restructuring of the ARC-related structures, this
patch conflicts with at least the following existing ZoL commits:
6e1d7276c9
Fix inaccurate arcstat_l2_hdr_size calculations
The ARC_SPACE_HDRS constant no longer exists and has been
somewhat equivalently replaced by HDR_L2ONLY_SIZE.
e0b0ca983d
Add visibility in to cached dbufs
The new layering of l{1,2}arc_buf_hdr_t within the arc_buf_hdr
struct requires additional structure member names to be used
when referencing the inner items. Also, the presence of L1 or L2
inner member is indicated by flags using the new HDR_HAS_L{1,2}HDR
macros.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
5369 arc flags should be an enum
5370 consistent arc_buf_hdr_t naming scheme
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Porting notes:
ZoL has moved some ARC definitions into arc_impl.h.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: Tim Chase <tim@chase2k.com>
5818 zfs {ref}compressratio is incorrect with 4k sector size
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Albert Lee <trisk@omniti.com>
References:
https://www.illumos.org/issues/5818https://github.com/illumos/illumos-gate/commit/81cd5c5
Ported-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3432
All fprintf() error messages are moved out of the libzfs_init()
library function where they never belonged in the first place. A
libzfs_error_init() function is added to provide useful error
messages for the most common causes of failure.
Additionally, in libzfs_run_process() the 'rc' variable was renamed
to 'error' for consistency with the rest of the code base.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
5243 zdb -b could be much faster
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5243https://github.com/illumos/illumos-gate/commit/f7950bf
Ported-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3414
If the pool/dataset command-line argument is specified with a trailing
slash, for example, "tank/", it is interpreted as the root dataset.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3415
5810 zdb should print details of bpobj
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/5810https://github.com/illumos/illumos-gate/commit/732885fc
Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3387
This commit updates the copyright boilerplate within the ZED subtree.
The instructions for appending a contributor copyright line have
been removed. Manually maintaining copyright notices in this
manner is error-prone, imprecise at a file-scope granularity, and
oftentimes inaccurate. These lines can become a pernicious source of
merge conflicts. A commit log is better suited to maintaining this
information. Consequently, a line has been added to the boilerplate
to refer to the git commit log for authoritative copyright attribution.
To account for the scenario where a file may become separated from
the codebase and commit history (i.e., it is copied somewhere else),
a line has been added to identify the file's origin.
http://softwarefreedom.org/resources/2012/ManagingCopyrightInformation.html
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3384
Improve the large block feature test coverage by extending ztest
to frequently change the recordsize. This is specificially designed
to catch corner cases which might otherwise go unnoticed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #354
5027 zfs large block support
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5027https://github.com/illumos/illumos-gate/commit/b515258
Porting Notes:
* Included in this patch is a tiny ISP2() cleanup in zio_init() from
Illumos 5255.
* Unlike the upstream Illumos commit this patch does not impose an
arbitrary 128K block size limit on volumes. Volumes, like filesystems,
are limited by the zfs_max_recordsize=1M module option.
* By default the maximum record size is limited to 1M by the module
option zfs_max_recordsize. This value may be safely increased up to
16M which is the largest block size supported by the on-disk format.
At the moment, 1M blocks clearly offer a significant performance
improvement but the benefits of going beyond this for the majority
of workloads are less clear.
* The illumos version of this patch increased DMU_MAX_ACCESS to 32M.
This was determined not to be large enough when using 16M blocks
because the zfs_make_xattrdir() function will fail (EFBIG) when
assigning a TX. This was immediately observed under Linux because
all newly created files must have a security xattr created and
that was failing. Therefore, we've set DMU_MAX_ACCESS to 64M.
* On 32-bit platforms a hard limit of 1M is set for blocks due
to the limited virtual address space. We should be able to relax
this one the ABD patches are merged.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#354
Support exporting all imported pools in one go, using 'zpool export -a'.
This is accomplished by moving the export parts from zpool_do_export()
in to the new function zpool_export_one(). The for_each_pool() function
is used to enumerate the list of pools to be exported. Passing an argc
of 0 implies the function should be called on all pools.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #3203
4951 ZFS administrative commands should use reserved space, not with ENOSPC
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/4373https://github.com/illumos/illumos-gate/commit/7d46dc6
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
3654 zdb should print number of ganged blocks
3656 remove unused function zap_cursor_move_to_key()
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/3654https://www.illumos.org/issues/3656https://github.com/illumos/illumos-gate/commit/d5ee8a1
Porting Notes:
3655 and 3657 were part of this commit but those hunks were dropped
since they apply to mdb.
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
5134 if ZFS_DEBUG or debug= is set, libzpool should enable debug prints
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/projects/illumos-gate/issues/5134https://github.com/illumos/illumos-gate/commit/7fa49ea
Porting notes:
Added dprintf_setup() to main in zfs_main.c and zpool_main.c.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2669
The data-notify.sh ZEDLET serves a very similar purpose to
io-notify.sh, namely, to generate a notification in response to a
particular error event. Initially, data-notify.sh was separated from
io-notify.sh since the "data" zevent does not (as I understand it)
pertain to a specific vdev device. This stands in contrast to the
"checksum" and "io" zevents (both handled by io-notify.sh) that can
be attributed to a specific vdev. At the time, it seemed simpler to
handle these two cases in separate scripts.
This commit adds support for the "data" zevent to io-notify.sh, and
symlinks io-notify.sh to data-notify.sh. It also adds the counts
for vdev_read_errors, vdev_write_errors, and vdev_cksum_errors to
the notification message.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
The io-spare.sh ZEDLET does not generate a notification when a failing
device is replaced with a hot spare. Maybe it should tell someone.
This commit adds a notification message to the io-spare.sh ZEDLET.
This notification is triggered when a failing device is successfully
replaced with a hot spare after encountering a checksum or io error.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
This commit adds the zed_notify_pushbullet() function and hooks
it into zed_notify(), thereby integrating it with the existing
"notify" ZEDLETs. This enables ZED to push notifications to your
desktop computer and/or mobile device(s). It is configured with the
ZED_PUSHBULLET_ACCESS_TOKEN and ZED_PUSHBULLET_CHANNEL_TAG variables
in zed.rc.
https://www.pushbullet.com/
The Makefile install-data-local target has been replaced with
install-data-hook. With the "-local" target, there is no particular
guarantee of execution order. But with the zed.rc now potentially
containing sensitive information (i.e., the Pushbullet access token),
the recommended permissions have changed to 0600. The "-hook" target
is always executed after the main rule's work is done; thus, the
chmod will always take place after the zed.rc file has been installed.
https://www.gnu.org/software/automake/manual/automake.html#Extending
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Several ZEDLETs already exist for sending email in reponse to a
particular zevent. While email is ubiquitous, alternative methods may
be better suited for some configurations. Instead of duplicating the
"email" ZEDLETs for every future notification method, it is preferable
to abstract the notification method into a function. This has the
added benefit of reducing the amount of code duplicated between
ZEDLETs, and allowing related bugs to be fixed in a single location.
This commit replaces the existing "email" ZEDLETs with corresponding
"notify" ZEDLETs. In addition, the ZEDLET code for sending an
email message has been moved into the zed_notify_email() function.
And this zed_notify_email() has been added to a generic zed_notify()
function for sending notifications via all available methods that
have been configured.
This commit also changes a couple of related zed.rc variables.
ZED_EMAIL_INTERVAL_SECS is changed to ZED_NOTIFY_INTERVAL_SECS,
and ZED_EMAIL_VERBOSE is changed to ZED_NOTIFY_VERBOSE. Note that
ZED_EMAIL remains unchanged as its use is solely for the email
notification method.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
This commit factors out several common ZEDLET code blocks into
zed-functions.sh. This shortens the length of the scripts, thereby
(hopefully) making them easier to understand and maintain.
In addition, this commit revamps the coding style used by the
scripts to be more consistent and (again, hopefully) maintainable.
It now mostly follows the Google Shell Style Guide. I've tried to
assimilate the following resources:
Google Shell Style Guide
https://google-styleguide.googlecode.com/svn/trunk/shell.xml
Dash as /bin/sh
https://wiki.ubuntu.com/DashAsBinSh
Filenames and Pathnames in Shell: How to do it Correctly
http://www.dwheeler.com/essays/filenames-in-shell.html
Common shell script mistakes
http://www.pixelbeat.org/programming/shell_script_mistakes.html
Finally, this commit updates the exit codes used by the ZEDLETs to be
more consistent with one another.
All scripts run cleanly through ShellCheck <http://www.shellcheck.net/>.
All scripts have been tested on bash and dash.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
The newroot nvlist should be freed before returning.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3264
The "zpool status" output shows the full pathname for file-type vdevs,
but only the basename component for disk-type vdevs. In commit
bee6665, the "basename" command was dropped from altering the vdev
name used when searching the "zpool status" output. Consequently,
hot-disk sparing for disk vdevs broke since "zpool status" output
was now being searched for the full pathname to the disk vdev.
Parsing the "zpool status" output in this manner is rather brittle.
It would be preferable to search for the vdev based on its guid.
But until that happens, this commit adds back the "basename" command
to fix the vdev name breakage.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3310
5695 dmu_sync'ed holes do not retain birth time
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5695https://github.com/illumos/illumos-gate/commit/70163ac
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3229
Make the 'zpool import' command honor the overlay property to allow
filesystems to be mounted on a non-empty directory. As it stands now
this property is only checked by the 'zfs mount' command. Move the
check into 'zfs_mount()` in libzpool so the property is honored for all
callers.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3227
When using 'zpool import' to scan for available pools prefer vdev names
which reference vdevs with more valid labels. There should be two labels
at the start of the device and two labels at the end of the device. If
labels are missing then the device has been damaged or is in some other
way incomplete. Preferring names with fully intact labels helps weed out
bad paths and improves the likelihood of being able to import the pool.
This behavior only applies when scanning /dev/ for valid pools. If a
cache file exists the pools described by the cache file will be used.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Closes#3145Closes#2844Closes#3107
Add the arc_summary Makefile to the build system so the script is
properly included in the distribution tarball and installed.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3147
Under Linux, at least, dladdr() doesn't reliably work for functions which
aren't in a DSO. Add the function name to ztest_info[].
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3130
In dry run mode, zpool should display more of the proposed pool
configuration for "zpool add". This commit adds support for displaying
cache devices.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1106
After 53698a4, the following error occurs when make deb.
CCLD zed
../../lib/libzfs/.libs/libzfs.so: undefined reference to `get_system_hostid'
Add libzpool.la to zed/Makefile.am to fix this
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3080
If spl_hostid is set via module parameter, it's likely different from
gethostid(). Therefore, the userspace tool should read it first before
falling back to gethostid().
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3034
When _zed_event_add_var() was updated to be the common routine
for adding zedlet environment variables, an additional snprintf()
was added to the processing of each nvpair. This commit changes
_zed_event_add_nvpair() to directly call _zed_event_add_var()
for nvpair non-array types, thereby removing a superfluous call to
snprintf(). For consistency, the helper functions for converting
nvpair array types are similarly adjusted to add variables.
The _zed_event_value_is_hex() and _zed_event_add_var() functions have
been moved up in the file since forward declarations are not used,
but no changes have been made to these functions.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3042
The zed_strings container stores strings in an AVL, but does not
check for duplicate strings being added. Within the AVL, strings
are indexed by the string value itself. avl_add() requires the node
being added must not already exist in the tree, and will assert()
if this is not the case.
This should not cause problems in practice. ZED uses this container
in two places. In zed_conf.c, it is used to store the names of
enabled zedlets as zed scans the zedlet directory listing; duplicate
entries cannot occur here since duplicate names cannot occur within
a directory. In zed_event.c, it is used to store the environment
variables (as "NAME=VALUE" strings) that will be passed to zedlets;
duplicate strings here should never happen unless there is a bug
resulting in a duplicate nvpair or environment variable.
This commit protects against adding a duplicate to a zed_strings
container by first checking for the string being added, and removing
the previous entry should one exist. This implements a "last one
wins" policy.
This commit also changes the prototype for zed_strings_add() to allow
the string key (by which it is indexed in the AVL) to differ from
the string value. By adding zedlet environment variables using the
variable name as the key, multiple adds for the same variable name
will result in only the last value being stored.
Finally, this commit routes all additions of zedlet environment
variables through the updated _zed_event_add_var(). This ensures
all zedlet environment variable names are properly converted.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3042
The original script displayed tunable parameters using sysctl calls.
This patch modifies this by displaying tunable parameters found in
/sys/modules/zfs/parameters/. modinfo calls are used to capture
descriptions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Ensure this script conforms to the projects style guidelines
by limiting line length to 80 columns.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Add a basic help option and usage description which is consistent
with arcstat.py and dbufstat.py. This also adds support for long
opts.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
The -p option is used to specify a specific page of output to be
displayed. If omitted, all output pages would be displayed.
arc_summary, as it stood, had really kludgy processing code for
processing the -p option. It relied on a try-except block which was
treated as an if statement and in normal operation would fail any time a
user didn't specify the -p option on the command line. When the
exception was thrown, the script would then display all output pages.
This happened whether the -p option was omitted or malformed. Thus, in
the principle use case, an exception would be raised in order to run the
script as desired. The same except code would be called regardless of
the exception, however, and malformed -p arguments would also cause the
script to execute.
Additionally, this required the function which handles the case where
all output pages were to be displayed, _call_all, to be potentially
called from several locations within main.
This commit refactors the option processing code to simplify it and make
it easier to catch runtime errors in the script. This is done by
specializing the try-except block to only have an exception when the -p
argument is malformed. When the -p option is correctly selected by the
user, it calls a function in the unSub array directly, which will only
display one page of output.
Finally in the context of this refactoring the page breaks have been
removed. Pages seem to have been added into the output in the FreeNAS
version of the script. This patch removes pages from the output to more
closely resemble the freebsd version of the script.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
1) Comment out stat sections whose kstats are not currently available
2) Port most of arc_summary to use spl kstats
3) Enable l2arc stats
5) Include compressed l2size
4) Minor style fixes / cleanup
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cburroughs <chris.burroughs@gmail.com>
Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
The arc_summary script is a useful utility for administrators on
other ZFS platforms. It provides a quick and easy way to get a high
level view of the current ARC state.
Historically this was a perl script but it was rewritten in python
for FreeNAS. We've decided to adopt the python version instead of
the perl version for a few reasons.
1) ZoL has no existing perl dependencies, but it does have a python
dependency for scripts such as arcstat.py and dbufstat.py. Using
python for arc_summary.py helps us minimize dependencies.
2) Most major Linux distributions already depend heavily on python
for their core infrastructure. This means it's very likely to
be available even very early in the boot process.
Original source:
https://github.com/freenas/freenas/blob/master/gui/tools/arc_summary.py
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cburroughs <chris.burroughs@gmail.com>
Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Slot mapping in vdev_id doesn't work on systems using mawk as the 'awk'
alternative. A regular expression in map_slot() contains an unquoted
empty string following the alternation (|) operator, which results in an
"missing operand" error with mawk. The solution is to rearrange the
expression so the alternation has two operands.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closeszfsonlinux/pkg-zfs#136Closeszfsonlinux/zfs#2965
Commit c66989b accidentally introduced a cstyle issue which went
unnoticed. This tiny patch corrects that oversight.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Added a handler for SIGWINCH, so that one header
is printed per screen even when the terminal resizes.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2847
The field descriptions from arcstat.py -v for the demand accesses
are inaccurate. They all begin with "Demand Data" yet the fields
actually covered both demand data and demand meta-data accesses.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2842
Change the zvol helper program to replace any embedded spaces
in the pool or dataset names with '+' to ensure we have valid
symlinks.
The '+' character was choosen because it is not a valid character
for a dataset name but it is allowed by udev. This ensures that
all dataset names with an embedded space will be translated to
a unique /dev/zvol/ symlink.
Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Signed-off-by: Darik Horn <dajhorn@vanadac.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2834
Installing outside of the prefix is not permissible under Gentoo Prefix.
The package manager will cause the installation process to fail if/when
it sees this. I could script a workaround inside the ebuild, but it
seemed to make more sense to make this more configurable.
Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2641
5164 space_map_max_blksz causes panic, does not work
5165 zdb fails assertion when run on pool with recently-enabled
space map_histogram feature
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5164https://www.illumos.org/issues/5165https://github.com/illumos/illumos-gate/commit/b1be289
Porting Notes:
The metaslab_fragmentation() hunk was dropped from this patch
because it was already resolved by commit 8b0a084.
The comment modified in metaslab.c was updated to use the correct
variable name, space_map_blksz. The upstream commit incorrectly
used space_map_blksize.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2697
4958 zdb trips assert on pools with ashift >= 0xe
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/4958https://github.com/illumos/illumos-gate/commit/2a104a5
Porting notes:
Keep the ZIO_FLAG_FASTWRITE define. This is for a feature present
in Linux but not yet in *BSD.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2697
On 32-bit systems setting 'zfs_arc_max = 256M' in zdb results in the
following segmentation fault. Rather than reverting 0ec0724 which
introduced this flaw this code is only used for 64-bit builds.
Segmentation fault (core dumped)
ztest: '/sbin/zdb -bcc -d -U /var/tmp/zpool.cache ztest' exit code 139
child exited with code 3
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
5169 zdb should limit its ARC size
5170 zdb -c should create more scrub i/os by default
5171 zdb should print status while loading metaslabs for leak detection
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/5169https://www.illumos.org/issues/5170https://www.illumos.org/issues/5171https://github.com/illumos/illumos-gate/commit/06be980
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2707
In userland we need to switch over to the temporary name once the
pool has been created, otherwise the root dataset won't mount
and the error "cannot open 'the_real_name': dataset does not exist"
is printed.
Signed-off-by: ilovezfs <ilovezfs@icloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2760
Change the zpool program to skip its hostid mismatch check in the
same way that libzfs already does.
Invoked imports fail if the ZPOOL_CONFIG_HOSTID nvpair is missing in
the /etc/zfs/zpool.cache file, which can happen as of the /etc/hostid
deprecation in commit zfsonlinux/spl@acf0ade362.
Signed-off-by: Darik Horn <dajhorn@vanadac.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2794
Add signal handlers to print a backtrace if we crash or assert.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2788
5176 lock contention on godfather zio
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Bayard Bell <Bayard.Bell@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/5176https://github.com/illumos/illumos-gate/commit/6f834bc
Porting notes:
Under Linux max_ncpus is defined as num_possible_cpus(). This is
largest number of cpu ids which might be available during the life
time of the system boot. This value can be larger than the number
of present cpus if CONFIG_HOTPLUG_CPU is defined.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2711
Reset struct zed_conf file descriptors to -1 after close(),
and pointers to NULL after free().
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2756
ZED uses an advisory lock on its state file to protect against
multiple instances running concurrently. However, work is planned
to move this state information into the kernel, and ZED will still
need to protect against starting multiple instances.
This commit adds an advisory lock on the PID file to protect against
starting multiple instances. A lock failure can be overridden with
the "-f" (force) command-line option. The advisory lock on the state
file is being retained for as long as the state information is stored
in the state file.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2756
Creating virtual machines that have their rootfs on ZFS on hosts that
have their rootfs on ZFS causes SPA namespace collisions when the
standard name rpool is used. The solution is either to give each guest
pool a name unique to the host, which is not always desireable, or boot
a VM environment containing an ISO image to install it, which is
cumbersome.
26b42f3f9d introduced `zpool import -t
...` to simplify situations where a host must access a guest's pool when
there is a SPA namespace conflict. We build upon that to introduce
`zpool import -t tname ...`. That allows us to create a pool whose
in-core name is tname, but whose on-disk name is the normal name
specified.
This simplifies the creation of machine images that use a rootfs on ZFS.
That benefits not only real world deployments, but also ZFSOnLinux
development by decreasing the time needed to perform rootfs on ZFS
experiments.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2417
zpool import's -t parameter is intended for use with -R when operating
on pools that belong to other systems. Like -R, pools imported in this
way should not update the cachefile unless explicitly requested. The
initial implementation allowed the cachefile to be updated when -R was
not used. This went uncaught during testing because -R had implicitly
disabled use of the cachefile.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2417
Adding to a property list only if there is no existing value is used
twice. Once by zpool create -R and again by zpool import -R. Now that
zpool create -t and zpool import -t also need it, lets refactor it into
a helper function to make the code more readable.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2417
The executables invoked by the ZED in response to a given zevent
have been generically referred to as "scripts". By convention,
these scripts have aimed to be /bin/sh compatible for reasons of
portability and comprehensibility. However, the ZED only requires
they be executable and (ideally) capable of reading environment
variables. As such, these scripts are now referred to as ZEDLETs
(ZFS Event Daemon Linkage for Executable Tasks).
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2735
When zed allocates memory via malloc(), it typically follows that
with a memset(). However, calloc() implementations can often perform
optimizations when zeroing memory:
https://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc
This commit replaces zed's use of malloc() with calloc().
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2736
The zed's io-spare.sh script defines a vdev_status() function to query
the 'zpool status' output for obtaining the status of a specified vdev.
This function contains a small awk script that uses a parameter
expansion (${parameter/pattern/string}) supported in bash but not
in dash. Under dash, this fails with a "Bad substitution" error.
This commit replaces the awk script with a (hopefully more portable)
sed script that has been tested under both bash and dash.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2536
The 'zpool list -v' command displays lots of info but excludes the
capacity of each disk. This should be added.
5147 zpool list -v should show individual disk capacity
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5147https://github.com/illumos/illumos-gate/commit/7a09f97
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2688
Remove all occurrences of reverse indentation from zed comments for
consistency within the project code base.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2695
Change mount code to diagnose filesystem versions that
are not supported by the current implementation.
Change upgrade code to do likewise and refuse to upgrade
a pool if any filesystems on it are a version which is
not supported by the current implementation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Closes: #2616
When the ZED_EMAIL_INTERVAL_SECS="3600" option is set in zed.rc
configuration file then notification emails should be rate limited.
Rate limiting is accomplished by maintaining a colon delimited state
file which includes the device name. Unfortunately there are valid
device names which include a colon and therefore prevent the rate
limiting for working properly. For this reason the delimiter has
been changed to a semi-colon.
Signed-off-by: louwrentius <louwrentius@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Closes#2645
This is a set of minor cleanup changes related to zed logging:
- Remove the program identity prefix from messages written to stderr
since systemd already prepends this output with the program name.
- Replace the copy of the program identity string with a ptr reference.
- Replace "pid" with "PID" for consistency in comments & strings.
- Rename the zed_log.c struct _ctx component "level" to "priority".
- Add the LOG_PID option for messages written to syslog.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2252
When the zed is started as a forking daemon (by default),
a race-condition exists where the parent process can terminate before
the pidfile has been created by the grandchild process. When invoked
as a Type=forking systemd service, this can result in the following:
systemd[1]: Starting ZFS Event Daemon (zed)...
systemd[1]: PID file /var/run/zed.pid not readable (yet?) after start.
This commit adds a daemonize pipe to allow the grandchild process to
signal the parent process that initialization is complete (and the
pidfile has been created). The parent process will wait for this
notification before exiting.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2252
4970 need controls on i/o issued by zpool import -XF
4971 zpool import -T should accept hex values
4972 zpool import -T implies extreme rewind, and thus a scrub
4973 spa_load_retry retries the same txg
4974 spa_load_verify() reads all data twice
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/4970https://www.illumos.org/issues/4971https://www.illumos.org/issues/4972https://www.illumos.org/issues/4973https://www.illumos.org/issues/4974https://github.com/illumos/illumos-gate/commit/e42d205
Notes:
This set of patches adds a set of tunable parameters for the
"extreme rewind" mode of pool import which allows control over
the traversal performed during such an import.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2598
The Intel DC S3500 and Intel DC S3700 are optimized to handle 4KB
sectors well despite of their 8KB page sizes, so we move them to a new
category for enterprise drives where they will receive ashift=12. They
are joined by the Intel 730 series, which uses the same disk controller,
as well as a San Disk enterprise drive. The drive IDs for these two were
obtained by myself with the drive_id utility. The drive ID for the 240GB
Intel 730 model was extrapolated from the drive ID for the 480GB model.
Lastly, we also add some Western Digital mobile drives. ryuo in
\#zfsonlinux on freenode obtained "ATA WDC WD2500BEVT-0" from
running drive_id on his own hardware. The additional drives in that
family were extrapolated from that identifer.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2601
4976 zfs should only avoid writing to a failing non-redundant top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature is enabled
4983 need to collect metaslab information via mdb
4984 device selection should use fragmentation metric
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/4976https://www.illumos.org/issues/4978https://www.illumos.org/issues/4979https://www.illumos.org/issues/4980https://www.illumos.org/issues/4981https://www.illumos.org/issues/4982https://www.illumos.org/issues/4983https://www.illumos.org/issues/4984https://github.com/illumos/illumos-gate/commit/2e4c998
Notes:
The "zdb -M" option has been re-tasked to display the new metaslab
fragmentation metric and the new "zdb -I" option is used to control
the maximum number of in-flight I/Os.
The new fragmentation metric is derived from the space map histogram
which has been rolled up to the vdev and pool level and is presented
to the user via "zpool list".
Add a number of module parameters related to the new metaslab weighting
logic.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2595
Add a new 'overlay' property (default 'off') that controls whether the
filesystem should be mounted even if the mountpoint is busy or if it
should fail with a 'mountpoint not empty'.
Doing overlay mounts is the default mount behavior on Linux, but not
in ZFS. It have been decided that following the ZFS behavior should
be the default, but this overlay allows for site administrator to
override this decision on a per-dataset basis.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #2503
4914 zfs on-disk bookmark structure should be named *_phys_t
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/4914https://github.com/illumos/illumos-gate/commit/7802d7b
Porting notes:
There were a number of zfsonlinux-specific uses of zbookmark_t which
needed to be updated. This should reduce the likelihood of further
problems like issue #2094 from occurring.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2558
4757 ZFS embedded-data block pointers ("zero block compression")
4913 zfs release should not be subject to space checks
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4757https://www.illumos.org/issues/4913https://github.com/illumos/illumos-gate/commit/5d7b4d4
Porting notes:
For compatibility with the fastpath code the zio_done() function
needed to be updated. Because embedded-data block pointers do
not require DVAs to be allocated the associated vdevs will not
be marked and therefore should not be unmarked.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2544
As of a recent group of Illumos/Delphix updates, zed needs libzfs_core
in order to resolve lzc_get_bookmarks() and likely other functions
going forward.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2534
4374 dn_free_ranges should use range_tree_t
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4374https://github.com/illumos/illumos-gate/commit/bf16b11
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2531
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>a
References:
https://www.illumos.org/issues/4370https://www.illumos.org/issues/4371https://github.com/illumos/illumos-gate/commit/43466aa
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2529
4171 clean up spa_feature_*() interfaces
4172 implement extensible_dataset feature for use by other zpool features
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>a
References:
https://www.illumos.org/issues/4171https://www.illumos.org/issues/4172https://github.com/illumos/illumos-gate/commit/2acef22
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2528
This functionality is already available in 'zfs get'. Providing
it for 'zpool get' is useful and good for consistency.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #2522
4756 metaslab_group_preload() could deadlock
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
The metaslab_group_preload() function grabs the mg_lock and then later
tries to grab the metaslab lock. This lock ordering may lead to a
deadlock since other consumers of the mg_lock will grab the metaslab
lock first.
References:
https://www.illumos.org/issues/4756https://github.com/illumos/illumos-gate/commit/30beaff
Ported-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2488
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101https://www.illumos.org/issues/4102https://www.illumos.org/issues/4103https://www.illumos.org/issues/4105https://www.illumos.org/issues/4106https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2488
When given a pool name via -e, zdb would attempt an import. If it
failed, then it would attempt a verbatim import. This behavior is
not always desirable so a -V switch is added to zdb to control the
behavior. When specified, a verbatim import is done. Otherwise,
the behavior is as it was previously, except no verbatim import
is done on failure.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2372
4168 ztest assertion failure in dbuf_undirty
4169 verbatim import causes zdb to segfault
4170 zhack leaves pool in ACTIVE state
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/4168https://www.illumos.org/issues/4169https://www.illumos.org/issues/4170https://github.com/illumos/illumos-gate/commit/7fdd916
Porting notes:
Of particular interest when troubleshooting corrupted pools, the
commonly-used "zdb -e" operation may perform verbatim imports and
furthermore, it will soon have direct support for verbatim imports via
a new "-V" option. The 4169 fix eliminates a common segfault case in
which spa_history_log_version() tries to access an un-opened dsl_pool_t.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2451Closes#2283Closes#2467
This patch is a zdb extension of the '-b' option, producing a histogram
of the physical compressed block sizes per DMU object type on disk. The
'-bbbb' option to zdb will uncover this new feature; here's an example
usage on a new pool and snippet of the output it generates:
# zpool create tank /dev/vd{b,c,d}
# dd bs=1k if=/dev/urandom of=/tank/1kfile count=1
# dd bs=3k if=/dev/urandom of=/tank/3kfile count=1
# dd bs=64k if=/dev/urandom of=/tank/64kfile count=1
# zdb -bbbb tank
...
3 68.0K 68.0K 68.0K 22.7K 1.00 34.26 ZFS plain file
psize (in 512-byte sectors): number of blocks
2: 1 *
3: 0
4: 0
5: 0
6: 1 *
7: 0
...
127: 0
128: 1 *
...
The blocks are also broken down by their indirection level. Expanding on
the above example:
# zfs set recordsize=1k tank
# dd bs=1k if=/dev/urandom of=/tank/2x1kfile count=2
# zdb -bbbb tank
...
1 16K 1K 2K 2K 16.00 1.02 L1 ZFS plain file
psize (in 512-byte sectors): number of blocks
2: 1 *
5 70.0K 70.0K 70.0K 14.0K 1.00 35.71 L0 ZFS plain file
psize (in 512-byte sectors): number of blocks
2: 3 ***
3: 0
4: 0
5: 0
6: 1 *
7: 0
...
127: 0
128: 1 *
6 86.0K 71.0K 72.0K 12.0K 1.21 36.73 ZFS plain file
psize (in 512-byte sectors): number of blocks
2: 4 ****
3: 0
4: 0
5: 0
6: 1 *
7: 0
...
127: 0
128: 1 *
...
There's now a single 1K L1 block which is the indirect block needed for
the '2x1kfile' file just created, as well as two more 1K L0 blocks from
the same file.
This can be used to get a distribution of the block sizes used within
the pool, on a per object type basis.
References:
https://illumos.org/issues/3641https://github.com/illumos/illumos-gate/commit/490d05b
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Boris Protopopov <boris.protopopov@me.com>
Closes#2456
Users need to be aware that when replacing devices in an existing
pool they may need to override automatically detected ashift value.
This will all depend on the exact hardware they are using.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2024
According to the man page, "When the noauto option is set, a dataset
can only be mounted and unmounted explicitly. The dataset is not
mounted automatically when the dataset is created or imported ...."
When cloning a dataset the canmount property was not being honored.
This patch adds the required check to achieve the behavior described
in the man page.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2241
Clang's static analyzer reported that the value assigned to pcksum is
never used. That is because we initialize both zc and pcksum to {{ 0 }}
and then do `pcksum = zc;`. That is fairly pointless. However, it has
the effect of generating a false positive in Clang's static analyzer.
Since noise from false positives can obscure real issues, we fix it
anyway.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2330
Verbatim imports can cause hostid mismatches, but things otherwise work. `zpool
status` does not handle this and will fail when assertions are enabled:
```
zpool: ../../cmd/zpool/zpool_main.c:4418: status_callback: Assertion `reason == ZPOOL_STATUS_OK' failed.
Program received signal SIGABRT, Aborted.
```
Lets instead add a case to display an informative message such as this:
```
pool: rpool
state: ONLINE
status: Mismatch between pool hostid and system hostid on imported pool.
This pool was previously imported into a system with a different hostid,
and then was verbatim imported into this system.
action: Export this pool on all systems on which it is imported.
Then import it to correct the mismatch.
see: http://zfsonlinux.org/msg/ZFS-8000-EY
scan: scrub repaired 0 in 0h8m with 0 errors on Thu Apr 17 19:43:57 2014
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
sda ONLINE 0 0 0
errors: No known data errors
```
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2342
When fetching property values of snapshots, a check against the head
dataset type must be performed. Previously, this additional check was
performed only when fetching "version", "normalize", "utf8only" or "case".
This caused the ZPL properties "acltype", "exec", "devices", "nbmand",
"setuid" and "xattr" to be erroneously displayed with meaningless values
for snapshots of volumes. It also did not allow for the display of
"volsize" of a snapshot of a volume.
This patch adds the headcheck flag paramater to zfs_prop_valid_for_type()
and zprop_valid_for_type() to indicate the check is being done
against a head dataset's type in order that properties valid only for
snapshots are handled correctly. This allows the the head check in
get_numeric_property() to be performed when fetching a property for
a snapshot.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2265
ztest is intended to subject the ZFS code in userland to stress that it
should be able to withstand. Any failures that occur when running it are
failures that likely would occur inside the kernel. However, being in
userland, it is much easier to debug them. In practice, this prevents
a large number of problems from reaching production code.
A design decision was made by the original authors of ztest to make a
distinction between userland locking primitives and kernel locking
primitives. The ztest code itself calls userland locking primitives
while the kernel code being run in userland will call emulated kernel
locking primitives that wrap the userland locking primitives.
When ztest was first ported to Linux, a decision was made to use the
emulated kernel interfaces everywhere. In effect, the userland
rw_rdlock()/rw_wrlock() became the kernel rw_enter() and and the userland
rw_unlock() became the kernel rw_exit(). This caused a regression
because of an assertion in rw_enter() to catch recursive locking. That
is permitted in userland, but not in the kernel. Consequently, the ztest
code itself does recursive read locking. The use of the emulated kernel
interfaces consequently caused the following failure:
ztest: ../../lib/libzpool/kernel.c:384: Assertion `rwlp->rw_owner !=
zk_thread_current() (0x1c87150 != 0x1c87150)' failed.
That occurs because ztest_dmu_objset_create_destroy() will take a read
lock and call ztest_dmu_object_alloc_free(). That will call ztest_io(),
which will take a readlock only when asked to do ZTEST_IO_REWRITE. This
triggered the assertion.
The pthreads rwlock interface was based on the LWP rwlock interface
implemented in Illumos libc. Luckily enough, the subset used by ztest is
almost identical, so we can solve this problem by switching to the LWP
thread rwlock interface in ztest. This eliminates a point of divergence
with Illumos and should make code sharing slightly easier.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1970
zfsonlinux/zfs@1db7b9be75 should have
fixed this, but this particular string was overlooked.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2288
When processing directory components starting from the root dir,
zed_file_create_dirs() contained a bug in checking the return value of
mkdir(). A typo was made, and the test for (mkdir_errno != EEXIST) was
erroneously written as (mkdir_errno == EEXIST). If some of the leading
directory components already existed, this bug would cause the routine
to exit before creating the remaining directory components.
Instead of fixing the above mkdir_errno test, this commit replaces
zed_file_create_dirs() with mkdirp(). This cleanup was already
planned, and zed_file_create_dirs() only existed because I didn't
realize mkdirp() was already in tree at the time.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2248
This is a continuation of fb5c53ea65:
When /etc/mtab is updated on Linux it's done atomically with
rename(2). A new mtab is written, the existing mtab is unlinked,
and the new mtab is renamed to /etc/mtab. This means that we
must close the old file and open the new file to get the updated
contents. Using rewind(3) will just move the file pointer back
to the start of the file, freopen(3) will close and open the file.
In this commit, a few more rewind(3) calls were replaced with freopen(3)
to allow updated mtab entries to be picked up immediately.
Signed-off-by: John M. Layman <jml@frijid.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2215
Issue #1611
zed supports a '-M' cmdline opt to lock all pages in memory via
mlockall(). The _POSIX_MEMLOCK define is checked to determine whether
this function is supported. The current test assumes mlockall()
is supported if _POSIX_MEMLOCK is non-zero. However, this test is
insufficient according to mlock(2) and sysconf(3). If _POSIX_MEMLOCK
is -1, mlockall() is not supported; but if _POSIX_MEMLOCK is 0,
availability must be checked at runtime.
This commit adds an autoconf check for mlockall() to user.m4. The zed
code block for mlockall() is now guarded with a test for HAVE_MLOCKALL.
If defined, mlockall() will be called and its runtime availability
checked via its return value.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2
When a vdev starts getting I/O or checksum errors it is now
possible to automatically rebuild to a hot spare device.
To cleanly support this functionality in a shell script some
additional information was added to all zevent ereports which
include a vdev. This covers both io and checksum zevents but
may be used but other scripts.
In the Illumos FMA solution the same information is required
but it is retrieved through the libzfs library interface.
Specifically the following members were added:
vdev_spare_paths - List of vdev paths for all hot spares.
vdev_spare_guids - List of vdev guids for all hot spares.
vdev_read_errors - Read errors for the problematic vdev
vdev_write_errors - Write errors for the problematic vdev
vdev_cksum_errors - Checksum errors for the problematic vdev.
By default the required hot spare scripts are installed but this
functionality is disabled. To enable hot sparing uncomment the
ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in the
/etc/zfs/zed.d/zed.rc configuration file.
These scripts do no add support for the autoexpand property. At
a minimum this requires adding a new udev rule to detect when
a new device is added to the system. It also requires that the
autoexpand policy be ported from Illumos, see:
https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/syseventd/modules/zfs_mod/zfs_mod.c
Support for detecting the correct name of a vdev when it's not
a whole disk was added by Turbo Fredriksson.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Issue #2
This functionality has always been missing. But until now there
were no zevents which included an array of strings so it wasn't
missed. However, that's now changed so to ensure this information
is output correctly by 'zpool events -v' the DATA_TYPE_STRING_ARRAY
has been implemented.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
Several of the zfs utilities allow you to pass a vdev's guid rather
than the device name. However, the utilities are not consistent in
how they parse that guid. For example, 'zinject' expects the guid
to be passed as a hex value while 'zpool replace' wants it as a
decimal. The user is forced to just know what format to use.
This patch improve things by making the parsing more tolerant.
When strtol(3) is called using 0 for the base, rather than say
10 or 16, it will then accept hex, decimal, or octal input based
on the prefix. From the man page.
If base is zero or 16, the string may then include a "0x"
prefix, and the number will be read in base 16; otherwise,
a zero base is taken as 10 (decimal) unless the next character
is '0', in which case it is taken as 8 (octal).
NOTE: There may be additional conversions not caught be this patch.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
zed monitors ZFS events. When a zevent is posted, zed will run any
scripts that have been enabled for the corresponding zevent class.
Multiple scripts may be invoked for a given zevent. The zevent
nvpairs are passed to the scripts as environment variables.
Events are processed synchronously by the single thread, and there is
no maximum timeout for script execution. Consequently, a misbehaving
script can delay (or forever block) the processing of subsequent
zevents. Plans are to address this in future commits.
Initial scripts have been developed to log events to syslog
and send email in response to checksum/data/io errors and
resilver.finish/scrub.finish events. By default, email will only
be sent if the ZED_EMAIL variable is configured in zed.rc (which is
serving as a config file of sorts until a proper configuration file
is implemented).
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2
zpool_events_next() can be called in blocking mode by specifying a
non-zero value for the "block" parameter. However, the design of
the ZFS Event Daemon (zed) requires additional functionality from
zpool_events_next(). Instead of adding additional arguments to the
function, it makes more sense to use flags that can be bitwise-or'd
together.
This commit replaces the zpool_events_next() int "block" parameter with
an unsigned bitwise "flags" parameter. It also defines ZEVENT_NONE
to specify the default behavior. Since non-blocking mode can be
specified with the existing ZEVENT_NONBLOCK flag, the default behavior
becomes blocking mode. This, in effect, inverts the previous use
of the "block" parameter. Existing callers of zpool_events_next()
have been modified to check for the ZEVENT_NONBLOCK flag.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2
Due to the very poorly chosen argument name 'cleanup_fd' it was
completely unclear that this file descriptor is used to track the
current cursor location. When the file descriptor is created by
opening ZFS_DEV a private cursor is created in the kernel for the
returned file descriptor. Subsequent calls to zpool_events_next()
and zpool_events_seek() then require the file descriptor as an
argument to reposition the cursor. When the file descriptor is
closed the kernel state tracking the cursor is destroyed.
This patch contains no functional change, it just changes a
few variable names and clarifies the documentation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
3120 zinject hangs in zfsdev_ioctl() due to uninitialized zc
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3120illumos/illumos-gate@f4c46b1eda
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2152
Originally, users had to handle spa namespace collisions by either
exporting the already imported pool or by specifying a new name for the
pool with a conflicting name. In the case of root pools from virtual
guests, neither approach to collision resolution is reasonable. This is
addressed by extending the new name syntax with a -t option to specify
that the new name is temporary. When specified, this sets an internal
flag that is passed into the kernel to tell it that all label updates
should refer to the name used in the original label. Consequently, the
original pool name will be retained on export.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2189
Stopping arcstat.py with ^C always ends up with error:
TypeError: sighandler() takes no arguments (2 given)
Since no special signal handling was done in sighandler(),
it's simpler to just set SIGINT handler to SIG_DFL, which
terminates the script.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes#2179
This is just a small bit of cleanup to ensure ztest fails early
on systems where mmap(2) is not functioning. For the automated
testing which is the primary consumer of ztest there is no
functional change because debugging is always enabled.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2177
Valgrind complained about this and it's absolutely right. The
props nvlist was not being freed in ztest_init.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#2174
Stopping arcstat.py with ^C always ends up with error:
TypeError: sighandler() takes no arguments (2 given)
This patch corrects the error by updating the signal handler
to take the correct number of arguments.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2182
Today arcstat.py prints one header every hdr_intr (20 by default)
lines. It would be more consistent with out utilities like vmstat
if hdr_intr defaulted to terminal window size, i.e. one header
per screenful of outputs.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2183
ZoL commit 1421c89 unintentionally changed the disk format in a forward-
compatible, but not backward compatible way. This was accomplished by
adding an entry to zbookmark_t, which is included in a couple of
on-disk structures. That lead to the creation of pools with incorrect
dsl_scan_phys_t objects that could only be imported by versions of ZoL
containing that commit. Such pools cannot be imported by other versions
of ZFS or past versions of ZoL.
The additional field has been removed by the previous commit. However,
affected pools must be imported and scrubbed using a version of ZoL with
this commit applied. This will return the pools to a state in which they
may be imported by other implementations.
The 'zpool import' or 'zpool status' command can be used to determine if
a pool is impacted. A message similar to one of the following means your
pool must be scrubbed to restore compatibility.
$ zpool import
pool: zol-0.6.2-173
id: 1165955789558693437
state: ONLINE
status: Errata #1 detected.
action: The pool can be imported using its name or numeric identifier,
however there is a compatibility issue which should be corrected
by running 'zpool scrub'
see: http://zfsonlinux.org/msg/ZFS-8000-ER
config:
...
$ zpool status
pool: zol-0.6.2-173
state: ONLINE
scan: pool compatibility issue detected.
see: https://github.com/zfsonlinux/zfs/issues/2094
action: To correct the issue run 'zpool scrub'.
config:
...
If there was an async destroy in progress 'zpool import' will prevent
the pool from being imported. Further advice on how to proceed will be
provided by the error message as follows.
$ zpool import
pool: zol-0.6.2-173
id: 1165955789558693437
state: ONLINE
status: Errata #2 detected.
action: The pool can not be imported with this version of ZFS due to an
active asynchronous destroy. Revert to an earlier version and
allow the destroy to complete before updating.
see: http://zfsonlinux.org/msg/ZFS-8000-ER
config:
...
Pools affected by the damaged dsl_scan_phys_t can be detected prior to
an upgrade by running the following command as root:
zdb -dddd poolname 1 | grep -P '^\t\tscan = ' | sed -e 's;scan = ;;' | wc -w
Note that `poolname` must be replaced with the name of the pool you wish
to check. A value of 25 indicates the dsl_scan_phys_t has been damaged.
A value of 24 indicates that the dsl_scan_phys_t is normal. A value of 0
indicates that there has never been a scrub run on the pool.
The regression caused by the change to zbookmark_t never made it into a
tagged release, Gentoo backports, Ubuntu, Debian, Fedora, or EPEL
stable respositorys. Only those using the HEAD version directly from
Github after the 0.6.2 but before the 0.6.3 tag are affected.
This patch does have one limitation that should be mentioned. It will not
detect errata #2 on a pool unless errata #1 is also present. It expected
this will not be a significant problem because pools impacted by errata #2
have a high probably of being impacted by errata #1.
End users can ensure they do no hit this unlikely case by waiting for all
asynchronous destroy operations to complete before updating ZoL. The
presence of any background destroys on any imported pools can be checked
by running `zpool get freeing` as root. This will display a non-zero
value for any pool with an active asynchronous destroy.
Lastly, it is expected that no user data has been lost as a result of
this erratum.
Original-patch-by: Tim Chase <tim@chase2k.com>
Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2094
From time to time it may be necessary to inform the pool administrator
about an errata which impacts their pool. These errata will by shown
to the administrator through the 'zpool status' and 'zpool import'
output as appropriate. The errata must clearly describe the issue
detected, how the pool is impacted, and what action should be taken
to resolve the situation. Additional information for each errata will
be provided at http://zfsonlinux.org/msg/ZFS-8000-ER.
To accomplish the above this patch adds the required infrastructure to
allow the kernel modules to notify the utilities that an errata has
been detected. This is done through the ZPOOL_CONFIG_ERRATA uint64_t
which has been added to the pool configuration nvlist.
To add a new errata the following changes must be made:
* A new errata identifier must be assigned by adding a new enum value
to the zpool_errata_t type. New enums must be added to the end to
preserve the existing ordering.
* Code must be added to detect the issue. This does not strictly
need to be done at pool import time but doing so will make the
errata visible in 'zpool import' as well as 'zpool status'. Once
detected the spa->spa_errata member should be set to the new enum.
* If possible code should be added to clear the spa->spa_errata member
once the errata has been resolved.
* The show_import() and status_callback() functions must be updated
to include an informational message describing the errata. This
should include an action message describing what an administrator
should do to address the errata.
* The documentation at http://zfsonlinux.org/msg/ZFS-8000-ER must be
updated to describe the errata. This space can be used to provide
as much additional information as needed to fully describe the errata.
A link to this documentation will be automatically generated in the
output of 'zpool import' and 'zpool status'.
Original-idea-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.or
Issue #2094
Both the zpool_import_status() and zpool_get_status() functions
return the zpool_status_t enum. This explicit type should be used
rather than the more generic int type.
This patch makes no functional change and should only be considered
code cleanup. It happens to have been done in the context of #2094
because that's when I noticed this issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.or
Issue #2094
The chunksize must always be strictly smaller than the regionsize.
Signed-off-by: Andrew Uselton <andrew.c.uselton@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2072
The vdev_id udev helper currently applies slot renumbering rules to
every channel (JBOD) in the system. This is too inflexible for systems
with non-homogeneous storage topologies. The "slot" keyword now takes
an optional third parameter which names a channel to which the mapping
will apply. If the third parameter is omitted then the rule applies to
all channels. The first-specified rule that can match a slot takes
precedence. Therefore a channel-specific rule for a given slot should
generally appear before a generic rule for the same slot number. In
this way a custom slot mapping can be applied to a particular channel
and a default mapping applied to the rest.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2056
31fc19399e incorrectly removed $(LIBBLKID)
from cmd/zpool/Makefile.am. This meant that the toolchain was not given
-lblkid, which resulted in the following build failure on Ubuntu 13.10:
/usr/bin/ld: zpool_vdev.o: undefined reference to symbol
'blkid_put_cache@@BLKID_1.0'
/lib/x86_64-linux-gnu/libblkid.so.1: error adding symbols: DSO missing
from command line
collect2: error: ld returned 1 exit status
That commit reworked various Makefile.am to follow best practices, so we
reintroduce $(LIBBLKID) in a manner consistent with that, rather than
explicitly reverting the change.
Reproduction of this issue was done on a Gentoo Linux system by
executing the following commands:
zfs create -o mountpoint=/mnt/ubuntu-13.10 rpool/ROOT/ubuntu-13.10
debootstrap --variant=buildd --arch amd64 saucy /mnt/ubuntu-13.10 http://archive.ubuntu.com/ubuntu/
mount -o bind /dev /mnt/ubuntu-13.10/dev/
mount -o bind /proc/ /mnt/ubuntu-13.10/proc/
mount -o bind /sys/ /mnt/ubuntu-13.10/sys/
cp /etc/resolv.conf /mnt/ubuntu-13.10/etc/
(cd /mnt/ubuntu-13.10/root/ && git clone git://github.com/zfsonlinux/zfs.git)
chroot /mnt/ubuntu-13.10/
apt-get install git autoconf libtool zlib1g-dev uuid-dev libblkid-dev
\#apt-get install alien fakeroot vim
cd /root/zfs
./autogen.sh
./configure --with-config=user --prefix=/usr
make
That will create a Ubuntu 13.10 chroot, fetch the sources and build
test. At this point, cmd/zpool/Makefile.am was modified and the
following commands were run to verify that the build issue was resolved:
git clean -xdf
./autogen.sh
./configure --with-config=user --prefix=/usr
make
Although it is not shown here, the absence of libblkid-dev enables ZFS
to build successfully without the patch. This could explain how this
escaped detection until recently. A test without libblkid-dev was done
to verify that the patch did not cause a regression in the absence of
libblkid:
apt-get remove libblkid-dev
git clean -xdf
./autogen.sh
./configure --with-config=user --prefix=/usr
make
Additionally, the commands themselves were tested against my live system
from within the chroot to ensure basic functionality. My live system had
corresponding kernel modules already installed and basic commands such
as `zpool list` and `zfs list` worked without incident. Lastly, this
patch was also build tested on Gentoo Linux, where it caused no
problems.
At time of writing, these steps can be used to reproduce these results
on any modern Linux system that has debootstrap installed. On Gentoo,
installing debootstrap can be done with `emerge dev-util/debootstrap`.
The current ZFSOnLinux HEAD revision as of writing is
fd23720ae1. Once this is fixed in HEAD,
either that revision or another before this fix and after
31fc19399e will be needed to reproduce
this issue.
Lastly, it remains to be seen why the toolchains on the systems
performing regression tests did not catch this. This is not a
ZFS-specific issue, but it is something that we will want to explore in
the future.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2038
Re-enable the /etc/mtab cache to prevent the zfs command from
having to repeatedly open and read from the /etc/mtab file.
Instead an AVL tree of the mounted filesystems is created and
used to vastly speed up lookups. This means that if non-zfs
filesystems are mounted concurrently the 'zfs mount' will not
immediately detect them. In practice that will rarely happen
and even if it does the absolute worst case would be a failed
mount. This was originally disabled out of an abundance of
paranoia.
NOTE: There may still be some parts of the code which do not
consult the mtab cache. They should be updated to check the
mtab cache as they as discovered to be a problem.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #845
Move the libzfs_fini() after the zpool_log_history() call so the
ZPOOL_HIST_CMD entry can get written.
Fix the handling of saved_poolname in zfsdev_ioctl()
which was broken as part of the stack-reduction work in
a168788053.
Since ZoL destroys the TSD data in which the previously successful
ioctl()'s pool name is stored following every vop, the ZFS_IOC_LOG_HISTORY
ioctl has a very important restriction: it can only successfully write
a long entry following a successful ioctl() if no intervening vops have
been performed. Some of zfs subcommands do perform intervening vops and
to do the logging themselves. At the moment, the "create" and "clone"
subcommands have been modified appropriately.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1998
Four new dataset properties have been added to support SELinux. They
are 'context', 'fscontext', 'defcontext' and 'rootcontext' which map
directly to the context options described in mount(8). When one of
these properties is set to something other than 'none'. That string
will be passed verbatim as a mount option for the given context when
the filesystem is mounted.
For example, if you wanted the rootcontext for a filesystem to be set
to 'system_u:object_r:fs_t' you would set the property as follows:
$ zfs set rootcontext="system_u:object_r:fs_t" storage-pool/media
This will ensure the filesystem is automatically mounted with that
rootcontext. It is equivalent to manually specifying the rootcontext
with the -o option like this:
$ zfs mount -o rootcontext=system_u:object_r:fs_t storage-pool/media
By default all four contexts are set to 'none'. Further information
on SELinux contexts is detailed in mount(8) and selinux(8) man pages.
Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#1504
The vast majority of these changes are in Linux specific code.
They are the result of not having an automated style checker to
validate the code when it was originally written. Others were
caused when the common code was slightly adjusted for Linux.
This patch contains no functional changes. It only refreshes
the code to conform to style guide.
Everyone submitting patches for inclusion upstream should now
run 'make checkstyle' and resolve any warning prior to opening
a pull request. The automated builders have been updated to
fail a build if when 'make checkstyle' detects an issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1821
4208 Typo in zfs_main.c: "posxiuser"
Reviewed by: Sonu Pillai <sonu.pillai@delphix.com>
Reviewed by: Will Guyette <will.guyette@delphix.com>
Reviewed by: Eric Diven <eric.diven@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/4208illumos/illumos-gate@f38cb554a5
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1986
Add acl, noacl and posixacl to option_map, avoiding ENOENT error
case when mount from util-linux-2.24 execs mount.zfs with any of
those flags
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: renelson <bnelson@nelsonbe.com>
Issue #1968
A minor grammar error was corrected in in the parse_options()
error handling for the ENOENT case.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: renelson <bnelson@nelsonbe.com>
Issue #1968
The bug is caused by multipath output like this:
35000c50056bd77a7 dm-15 HP,MB3000FCWDH
size=2.7T features='0' hwhandler='0' wp=rw
|-+- policy='round-robin 0' prio=0 status=active
| `- 2:0:16:0 sdq 65:0 active undef running
`-+- policy='round-robin 0' prio=0 status=enabled
`- 4:0:52:0 sdfp 130:176 active undef running
Note that the pipe symbols mean that the field numbering is different
between the sdq and sdfp lines. The fix edits out the pipe symbols.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1692
Added:
Adata S396 (obtained from drive_id)
Apple MacBookAir3,1 SSD (obtained from drive_id)
Apple MacBookPro10,1 SSD (obtained from drive_id)
Intel 510 (obtained from drive_id)
Intel 710 (obtained from drive_id)
Intel DC S3500 (obtained from drive_id)
Netapp LUN (obtained from illumos user's sd.conf)
OCZ Agility 3 (obtained from drive_id)
OCZ Vertex (obtained from drive_id)
Samsung PM800 (obtained from drive_id)
Sandisk U100 (obtained from drive_id)
Sun Comstar (obtained from illumos user's sd.conf)
Notes:
1. The entries for the Intel DC S3500 were extrapolated from the 800GB
model's entry, which is "ATA INTEL SSDSC2BB80".
2. The entires for the Intel 710 were extrapolated from the 120GG
model's entry, which is "ATA INTEL SSDSA2BZ12".
3. The entires for the Intel 510 were extrapolated from the 250GB
model's entry, which is "ATA INTEL SSDSC2MH25".
4. The entires for the Apple MacBookPro10,1 SSD were extrapolated from
the 512GB model's entry, which is "ATA APPLE SSD SM512E". Google
searches suggest that this is a rebadged Samsung 830.
5. The entires for the Apple MacBookAir3,1 SSD were extrapolated from
the 128GB model's entry, which is "ATA APPLE SSD TS128C". Google
searches suggest that this is a rebadged Kingston SSDNow V+ 100 (based
on Toshiba).
6. Sun Comstar is an iSCSI Target, so we cannot tell what the correct
sector size is through this method. We list it only for reference
purposes, but it is commented out. Similarly, it is not clear what the
right thing to do for Netapp is, so we comment it out.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1907
On some platforms symbols provided by libzfs_core and used by
libzfs were not available to the linker. To avoid this issue
libzfs_core has been added to the list of required libraries
when building utilities which depend on libzfs. This should
have been handled properly by libtool and it's still not
entirely clear why it wasn't on all platforms.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1841
Future proofing for compatibility with newer versions of Python.
Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1838
Update the code to follow the pep8 style guide.
References:
http://www.python.org/dev/peps/pep-0008/
Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1838
In the current snapshot automount implementation, it is possible for
multiple mounts to attempted concurrently. Only one of the mounts will
succeed and the other will fail. The failed mounts will cause an EREMOTE
to be propagated back to the application.
This commit works around the problem by adding a new exit status,
MOUNT_BUSY to the mount.zfs program which is used when the underlying
mount(2) call returns EBUSY. The zfs code detects this condition and
treats it as if the mount had succeeded.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1819
3956 ::vdev -r should work with pipelines
3957 ztest should update the cachefile before killing itself
3958 multiple scans can lead to partial resilvering
3959 ddt entries are not always resilvered
3960 dsl_scan can skip over dedup-ed blocks if physical birth != logical birth
3961 freed gang blocks are not resilvered and can cause pool to suspend
3962 ztest should print out zfs debug buffer before exiting
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3956https://www.illumos.org/issues/3957https://www.illumos.org/issues/3958https://www.illumos.org/issues/3959https://www.illumos.org/issues/3960https://www.illumos.org/issues/3961https://www.illumos.org/issues/3962illumos/illumos-gate@b4952e17e8
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting notes:
1. zfs_dbgmsg_print() is only used in userland. Since we do not have
mdb on Linux, it does not make sense to make it available in the
kernel. This means that a build failure will occur if any future
kernel patch depends on it. However, that is unlikely given that
this functionality was added to support zdb.
2. zfs_dbgmsg_print() is only invoked for -VVV or greater log levels.
This preserves the existing behavior of minimal noise when running
with -V, and -VV.
3. In vdev_config_generate() the call to nvlist_alloc() was not
changed to fnvlist_alloc() because we must pass KM_PUSHPAGE in
the txg_sync context.
3949 ztest fault injection should avoid resilvering devices
3950 ztest: deadman fires when we're doing a scan
3951 ztest hang when running dedup test
3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3949https://www.illumos.org/issues/3950https://www.illumos.org/issues/3951https://www.illumos.org/issues/3952illumos/illumos-gate@2c1e2b4414
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. The deadman thread was removed from ztest during the original
port because it depended on Solaris thr_create() interface.
This functionality should be reintroduced using the more
portable pthreads.
3112 ztest does not honor ZFS_DEBUG
3113 ztest should use watchpoints to protect frozen arc bufs
3114 some leaked nvlists in zfsdev_ioctl
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Amdur <Matt.Amdur@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>
References:
https://www.illumos.org/issues/3112https://www.illumos.org/issues/3113https://www.illumos.org/issues/3114illumos/illumos-gate@cd1c8b85eb
The /proc/self/cmd watchpoint interface is specific to Solaris.
Therefore, the #3113 implementation was reworked to use the more
portable mprotect(2) system call. When the pages are watched they
are marked read-only for protection. Any write to the protected
address range immediately trigger a SIGSEGV. The pages are marked
writable again when they are unwatched.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1489
3236 zio nop-write
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
illumos/illumos-gate@80901aea8ehttps://www.illumos.org/issues/3236
Porting Notes
1. This patch is being merged dispite an increased instance of
https://www.illumos.org/issues/3113 being triggered by ztest.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1489
3894 zfs should not allow snapshot of inconsistent dataset
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/3894illumos/illumos-gate@ca48f36f20
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3740 Poor ZFS send / receive performance due to snapshot
hold / release processing
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3740illumos/illumos-gate@a7a845e4bf
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. 13fe019870 introduced a merge conflict
in dsl_dataset_user_release_tmp where some variables were moved
outside of the preprocessor directive.
2. dea9dfefdd747534b3846845629d2200f0616dad made the previous merge
conflict worse by switching KM_SLEEP to KM_PUSHPAGE. This is notable
because this commit refactors the code, adding a new KM_SLEEP
allocation. It is not clear to me whether this should be converted
to KM_PUSHPAGE.
3. We had a merge conflict in libzfs_sendrecv.c because of copyright
notices.
4. Several small C99 compatibility fixed were made.
3745 zpool create should treat -O mountpoint and -m the same
3811 zpool create -o altroot=/xyz -O mountpoint=/mnt ignores
the mountpoint option
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3745https://www.illumos.org/issues/3811illumos/illumos-gate@8b71377531
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3603 panic from bpobj_enqueue_subobj()
3604 zdb should print bpobjs more verbosely
3871 GCC 4.5.3 does not like issue 3604 patch
Reviewed by: Henrik Mattson <henrik.mattson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/3603https://www.illumos.org/issues/3604https://www.illumos.org/issues/3871illumos/illumos-gate@d04756377d
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Note that the patch from Illumos issue 3871 is not accepted into Illumos
at the time of this writing. It is something that I wrote when porting
this. Documentation is in the Illumos issue.
This works the same as the -p switch to "zfs get", displaying full
resolution values for appropriate attributes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1813
The dbufstat.py command was added to provide a conveniant way to
easily determine what ZFS is caching. The script consumes the
raw /proc/spl/kstat/zfs/dbufs kstat data can consolidates it in
to a more human readable form. This was designed primarily as
a tool to aid developers but it may also be useful for advanced
users who want more visibility in to what the ARC is caching.
When run without options dbufstat.py will default to showing a
list of all objects with at least one buffer present in the
cache. The total cache space consumed by that object will be
printed on the right along with the object type. Similar to the
arcstats.py command the -x option may used to display additional
fields.
Two other modes of operation are also supported by dbufstat.py
and the expectation is additional display modes may be added as
needed. The -t option will summerize the total number of bytes
cached for each object type, and the -b option will show every
dbuf currently cached.
The script was designed to be consistent with arcstat.py and
includes most of the same options and funcationality.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When creating a new pool, or adding/replacing a disk in an existing
pool, partition tables will be automatically created on the devices.
Under normal circumstances it will take less than a second for udev
to create the expected device files under /dev/. However, it has
been observed that if the system is doing heavy IO concurrently udev
may take far longer. If you also throw in some cheap dodgy hardware
it may take even longer.
To prevent zpool commands from failing due to this the default wait
time for udev is being increased to 30 seconds. This will have no
impact on normal usage, the increase timeout should only be noticed
if your udev rules are incorrectly configured.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1646
Document the "-D" and "-T" options and the optional interval
and count or "zpool status".
Also for zpool's man page, use a consistent order for the
various "-T" options to match the program's help output.
Document the effect of additional "-D" options for zdb.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1786
Libraries that depend on other libraries should list them in ELF's
DT_NEEDED field so that programs linking to them do not need to specify
those libraries unless they depend on them as well. This is not the case
in the current code and the consequence is that anything that needs a
library must know its dependencies. This is fragile and caused GRUB2's
configure script to break when a dependency was added on libblkid in
libzfs.
This resolves that problem by using LIBADD/LDADD to specify libraries in
Makefile.am instead of LDFLAGS. This ensures that proper DT_NEEDED
entries are generated and prevents GRUB2's configure script from
breaking in the presence of a libblkid dependency. This also removes
unneeded dependencies from various files.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1751
Add Corsair Force GS drive (obtained from drive_id)
Add Kingston HyperX 3K (obtained from drive_id)
Add OCZ Vertex 4 drive (obtained from drive_id)
Add Samsung SM843T enterprise drive (obtained from drive_id)
Add entries for additional sizes of Intel 320/330/335/520 series
Add Cruical C400 (obtained from Illumos user's sd.conf)
Add Toshiba SSD (obtained from Illumos user's sd.conf)
Add Samsung's first SLC SSD (obtained from drive_id)
Add OCZ Core Series (obtained from drive_id)
Add Intel DC S3700 (obtained from drive_id)
Notes:
1. The drive identifer obtained for the Samsung SM843T was MZ7WD480. The
rest were extrapolated. The additional entries were checked with Google
to verify that such drives exist in the wild.
2. The additional entries for Intel drives were extrapolated from
existing entries. The additional entries were checked with Google to
verify that such drives exist in the wild.
3. The "ATA C400-MTFDDAC512M" and "ATA TOSHIBA THNSNH51" entries
are from the sd.conf of gcbirzan on freenode. Additional entries were
extrapolated from them and checked with Google.
4. I obtained the Samsung MCCOE64G entry from an actual drive. The
Samsung MCCOE32G entry was extrapolated from it and checked with
Google.
5. I obtained the SSDSC2BA10 from a 100GB Intel DC S3700 drive and
extrapolated the entries for the additional models.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1752
2882 implement libzfs_core
2883 changing "canmount" property to "on" should not always remount dataset
2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
https://www.illumos.org/issues/2882https://www.illumos.org/issues/2883https://www.illumos.org/issues/2900illumos/illumos-gate@4445fffbbb
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1293
Porting notes:
WARNING: This patch changes the user/kernel ABI. That means that
the zfs/zpool utilities built from master are NOT compatible with
the 0.6.2 kernel modules. Ensure you load the matching kernel
modules from master after updating the utilities. Otherwise the
zfs/zpool commands will be unable to interact with your pool and
you will see errors similar to the following:
$ zpool list
failed to read pool configuration: bad address
no pools available
$ zfs list
no datasets available
Add zvol minor device creation to the new zfs_snapshot_nvl function.
Remove the logging of the "release" operation in
dsl_dataset_user_release_sync(). The logging caused a null dereference
because ds->ds_dir is zeroed in dsl_dataset_destroy_sync() and the
logging functions try to get the ds name via the dsl_dataset_name()
function. I've got no idea why this particular code would have worked
in Illumos. This code has subsequently been completely reworked in
Illumos commit 3b2aab1 (3464 zfs synctask code needs restructuring).
Squash some "may be used uninitialized" warning/erorrs.
Fix some printf format warnings for %lld and %llu.
Apply a few spa_writeable() changes that were made to Illumos in
illumos/illumos-gate.git@cd1c8b8 as part of the 3112, 3113, 3114 and
3115 fixes.
Add a missing call to fnvlist_free(nvl) in log_internal() that was added
in Illumos to fix issue 3085 but couldn't be ported to ZoL at the time
(zfsonlinux/zfs@9e11c73) because it depended on future work.
This implements vdev_bdev_database_check(). It alters the detected
sector size of any device listed in a database of drives known to lie
about their physical sector sizes.
This is based on "6931570 Add flash devices' VID/PID to disk table to
advertising 4K physical sector size" from Open Solaris and on
sg_simple4.c from sg3_utils. About two dozen lines are taken from
sg_simple4.c, which is GPLv2 licensed. However, sg_simple4.c is
analogous to a Hello World program and is safe for us to use. We
requested that Douglas Gilbert, the author of sg_simple4.c, confirm that
this is the case. A cutdown version of his response is as follows:
```
I would consider a SCSI INQUIRY example using the Linux sg
driver interface (also written by me) as the equivalent of an
"hello world" program in C.
```
The database was created with the help of the freenode and ZFSOnLinux
communities.
Some notes:
1. The following drives both were confirmed to lie via reports in IRC
and they contain capacity information in their identifiers:
INTEL SSDSA2M080
INTEL SSDSA2M160
M4-CT256M4SSD2
WDC WD15EARS-00S
WDC WD15EARS-00Z
WDC WD20EARS-00M
The identifiers for different capacity models were extrapolated and
added under the assumption that those models also lie. Google was used
to verify that the extrapolated drive identifiers existed prior to their
inclusion.
2. The OCZ-VERTEX2 3.5 identifer applies to two drives that differ
solely in page size (and slightly in capacity). One uses 4096-byte pages
and the other uses 8192-byte pages. Both are set to use 8192-byte pages.
We could detect the page size by checking the capacity, but that would
unnecessarily complicate the code.
3. It is possible for updated drive firmware to correctly report the
sector size. There were reports of a few advanced format drives doing
that. One report stated that the vendor changed the identification
string while another was unclear on this. Both reports involved WDC
models.
4. Google was used to determine the size of pages in the listed flash
devices. Reports of 8192-byte pages took precedence over reports of
4096-byte pages.
5. Devices behind USB adapters can have their identification strings
altered. Identification strings obtained across USB adapters are
omitted and no attempt is made to correct for alterations made by USB
adapters when doing comparisons against the database. Two entries in the
Open Solaris database that appear to have been altered by a USB
adapter were omitted.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1652
For the same reasons it's used in libzfs_init(), this was just
overlooked because zinject gets minimal use.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1498
3098 zfs userspace/groupspace fail without saying why when run as non-root
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3098illumos/illumos-gate@70f56fa693
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1596
Due to an uninitialized variable it was possible for the command
'zpool list -H' to return a non-zero error when there are no pools.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1605
The FreeBSD implementation of zfs adds the 'zpool labelclear'
command. Since this functionality is helpful and straight
forward to add it is being included in ZoL.
References:
freebsd/freebsd@119a041dc9
Ported-by: Dmitry Khasanov <pik4ez@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1126
If the znode has SA xattrs, display them following the other
standard attributes. The format used is similar to that used
when listing the contents of a ZAP. It is as follows:
$ zdb -vvv <pool>/<dataset> <object>
...
SA xattrs: <size> bytes, <number> entries
<name1> = <value1>
<name2> = <value2>
...
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1581
Make sure that buffer is aligned to 512 bytes on linux so that
pread call combined with O_DIRECT does not return EINVAL.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1570
For "zpool events -f" flush stdout to ensure the last zevent
is always printed immediately.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1568
A mount failure was accidentally introduced by commit 0c1171d
which reworked the parse_dataset() function to read pool names
from devices. The error case where a label is read from the
device but the pool name/value pair doesn't exist was not
handled properly. In this case we should fall back to the
previous behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1560
This silences a GCC 4.8.0 warning by fixing a programming error
caught by static analysis:
../../cmd/ztest/ztest.c: In function ‘ztest_vdev_aux_add_remove’:
../../cmd/ztest/ztest.c:2584:33: error: argument to ‘sizeof’
in ‘snprintf’ call is the same expression as the destination;
did you mean to provide an explicit length?
[-Werror=sizeof-pointer-memaccess]
(void) snprintf(path, sizeof (path), ztest_aux_template,
^
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1480
When using zdb with non-default SPA config file it is not convenient
to add -U <non-default-config-file-path> all the time. This commit
introduces support for setting/overriding SPA config location via
environment variable 'SPA_CONFIG_PATH'.
If -U flag is specified in the command line it will override any other
value as usual.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1545
To simplify integration with the xfstests test suite the
mount.zfs helper has been extended. When passed a block
device (/dev/sdX) to mount, instead of a pool/dataset,
the pool name will be read from any existing zfs label
and used. This allows you to mount the root dataset of
a zfs filesystem by specifing any of the member vdevs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread
3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not condensing)
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
illumos/illumos-gate@16a4a80742https://www.illumos.org/issues/3552https://www.illumos.org/issues/3564
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1513
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first
argument is zero
Reviewed by Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by George Wilson <george.wilson@delphix.com>
Approved by Eric Schrock <eric.schrock@delphix.com>
References:
illumos/illumos-gate@fb09f5aad4https://illumos.org/issues/3006
Requires:
zfsonlinux/spl@1c6d149feb
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1509
* Modified kstat_update() to read arcstats from proc.
* Fix shebang.
* Added Makefile.am entries for arcstat.py
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1506
When calling mount(), care must be taken to avoid passing in flags
that are used only by the user space utilities. Otherwise we may
stomp on flags that are reserved for other purposes in the kernel.
In particular, openSUSE 12.3 kernels have added a new MS_RICHACL
super-block flag whose value conflicts with our MS_COMMENT flag. This
causes incorrect behavior such as the umask being ignored. The
MS_COMMENT flag essentially serves as a placeholder in the option_map
data structure of zfs_mount.c, but its value is never used. Therefore
we can avoid the conflict by defining it to 0.
The MS_USERS, MS_OWNER, and MS_GROUP flags also conflict with reserved
flags in the kernel. While this is not known to have caused any
problems, it is nevertheless incorrect. For the purposes of the
mount.zfs helper, the "users", "owner", and "group" options just serve
as hints to set additional implied options. Therefore we now define
their associated mount flags in terms of the options that they imply
rather than giving them unique values.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1457
3306 zdb should be able to issue reads in parallel
3321 'zpool reopen' command should be documented in the man
page and help
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
illumos/illumos-gate@31d7e8fa33https://www.illumos.org/issues/3306https://www.illumos.org/issues/3321
The vdev_file.c implementation in this patch diverges significantly
from the upstream version. For consistenty with the vdev_disk.c
code the upstream version leverages the Illumos bio interfaces.
This makes sense for Illumos but not for ZoL for two reasons.
1) The vdev_disk.c code in ZoL has been rewritten to use the
Linux block device interfaces which differ significantly
from those in Illumos. Therefore, updating the vdev_file.c
to use the Illumos interfaces doesn't get you consistency
with vdev_disk.c.
2) Using the upstream patch as is would requiring implementing
compatibility code for those Solaris block device interfaces
in user and kernel space. That additional complexity could
lead to confusion and doesn't buy us anything.
For these reasons I've opted to simply move the existing vn_rdwr()
as is in to the taskq function. This has the advantage of being
low risk and easy to understand. Moving the vn_rdwr() function
in to its own taskq thread also neatly avoids the possibility of
a stack overflow.
Finally, because of the additional work which is being handled by
the free taskq the number of threads has been increased. The
thread count under Illumos defaults to 100 but was decreased to 2
in commit 08d08e due to contention. We increase it to 8 until
the contention can be address by porting Illumos #3581.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1354
The zfs_fd must be opened before calling print_all_handlers() or
the ioctl() cannot be used to the zfs control device. This brings
the zinject code back in sync with the Illumos implementation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
NOTES: This patch has been reworked from the original in the
following ways to accomidate Linux ZFS implementation
*) Usage of the cyclic interface was replaced by the delayed taskq
interface. This avoids the need to implement new compatibility
code and allows us to rely on the existing taskq implementation.
*) An extern for zfs_txg_synctime_ms was added to sys/dsl_pool.h
because declaring externs in source files as was done in the
original patch is just plain wrong.
*) Instead of panicing the system when the deadman triggers a
zevent describing the blocked vdev and the first pending I/O
is posted. If the panic behavior is desired Linux provides
other generic methods to panic the system when threads are
observed to hang.
*) For reference, to delay zios by 30 seconds for testing you can
use zinject as follows: 'zinject -d <vdev> -D30 <pool>'
References:
illumos/illumos-gate@283b84606bhttps://www.illumos.org/issues/3246
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1396
A few files still refer to @behlendorf's private fork on
github. Use the primary web site URL instead. Two typos
are also corrected.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The PaX team modified the kernel's modpost to report writeable function
pointers as section mismatches because they are potential exploit
targets. We could ignore the warnings, but their presence can obscure
actual issues. Proper const correctness can also catch programming
mistakes.
Building the kernel modules against a PaX/GrSecurity patched Linux 3.4.2
kernel reports 133 section mismatches prior to this patch. This patch
eliminates 130 of them. The quantity of writeable function pointers
eliminated by constifying each structure is as follows:
vdev_opts_t 52
zil_replay_func_t 24
zio_compress_info_t 24
zio_checksum_info_t 9
space_map_ops_t 7
arc_byteswap_func_t 5
The remaining 3 writeable function pointers cannot be addressed by this
patch. 2 of them are in zpl_fs_type. The kernel's sget function requires
that this be non-const. The final writeable function pointer is created
by SPL_SHRINKER_DECLARE. The kernel's set_shrinker() and
remove_shrinker() functions also require that this be non-const.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1300
The issue with hot spares in ZoL is because it opens all leaf
vdevs exclusively (O_EXCL). On Linux, exclusive opens cause
subsequent exclusive opens to fail with EBUSY.
This could be resolved by not opening any of the devices
exclusively, which is what Illumos does, but the additional
protection offered by exclusive opens is desirable. It cleanly
prevents you from accidentally adding an in-use non-ZFS device
to your pool.
To fix this we very slightly relaxed the usage of O_EXCL in
the following ways.
1) Functions which open the device but only read had the
O_EXCL flag removed and were updated to use O_RDONLY.
2) A common holder was added to the vdev disk code. This
allow the ZFS code to internally open the device multiple
times but non-ZFS callers may not.
3) An exception was added to make_disks() for hot spare when
creating partition tables. For hot spare devices which
are already opened exclusively we skip creating the partition
table because this must already have been done when the disk
was originally added as a hot spare.
Additional minor changes include fixing check_in_use() to use
a partition instead of a slice suffix. And is_spare() was moved
above make_disks() to avoid adding a forward reference.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#250
`zpool status -x` should only flag errors or where the pool is
unavailable. If it imported fine but isn't using the latest features
available in the code, that's not an error.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1319
This is a minor nit, but the second line of the 'action' message
when you need to upgrade your pool to support feature flags exceeds
the standard 80 character limit. Fix it by moving the word
'feature' on to the third line.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In the interest of maintaining only one udev helper to give vdevs
user friendly names, the zpool_id and zpool_layout infrastructure
is being retired. They are superseded by vdev_id which incorporates
all the previous functionality.
Documentation for the new vdev_id(8) helper and its configuration
file, vdev_id.conf(5), can be found in their respective man pages.
Several useful example files are installed under /etc/zfs/.
/etc/zfs/vdev_id.conf.alias.example
/etc/zfs/vdev_id.conf.multipath.example
/etc/zfs/vdev_id.conf.sas_direct.example
/etc/zfs/vdev_id.conf.sas_switch.example
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#981
The vdev_id udev helper strictly requires configuration file keywords
to always be anchored at the beginning of the line and to be followed
by a space character. However, users may prefer to use indentation or
tab delimitation. Improve flexibility by simply requiring a keyword
to be the first field on the line.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1239
1337 `zpool status -D' should tell if there are no DDT entries
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Albert Lee <trisk@nexenta.com>
References:
illumos/illumos-gate@ce72e614c1
illumos changeset: 13432:d1ad8d106d64
https://www.illumos.org/issues/1337
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
A fsck helper to accomidate distributions that expect to be able
to execute a fsck on all filesystem types. Currently this script
does nothing but it could be extended to act as a compatibility
wrapper for 'zpool scrub'.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#964
Rather than just reporting the failure include the passed
mount point and error number.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1153
3349 zpool upgrade -V bumps the on disk version number, but leaves
the in core version
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@25345e4666https://www.illumos.org/issues/3349
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2762 zpool command should have better support for feature flags
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@57221772c3https://www.illumos.org/issues/2762
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@dfbb943217
illumos changeset: 13777:b1e53580146d
https://www.illumos.org/issues/3090https://www.illumos.org/issues/3102
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#939
This reverts commit d135245791.
Since feature flags have now been merged we can apply the real
upstream fix from Illumos.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #997
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@53089ab7c8illumos/illumos-gate@ad135b5d64
illumos changeset: 13700:2889e2596bd6
https://www.illumos.org/issues/2619https://www.illumos.org/issues/2747
NOTE: The grub specific changes were not ported. This change
must be made to the Linux grub packages.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
mountall in Debian depends on being able to pass the -f parameter to
mount, which specifies a fake mount and just updates the mtab. Currently
mount.zfs will fail such a request if it is not passed with -o zfsutil.
This patch allows a fake mount on a non-legacy filesystem to succeed in
the same manner as a -o remount does, thus enabling mountall to work
correctly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1167
Add a vdev_id feature to map device names based on already defined
udev device links. To increase the odds that vdev_id will run after
the rules it depends on, increase the vdev.rules rule number from 60
to 69. With this change, vdev_id now provides functionality analogous
to zpool_id and zpool_layout, paving the way to retire those tools.
A defined alias takes precedence over a topology-derived name, but the
two naming methods can otherwise coexist. For example, one might name
drives in a JBOD with the sas_direct topology while naming an internal
L2ARC device with an alias.
For example, the following lines in vdev_id.conf will result in the
creation of links /dev/disk/by-vdev/{d1,d2}, each pointing to the same
target as the device link specified in the third field.
# by-vdev
# name fully qualified or base name of device link
alias d1 /dev/disk/by-id/wwn-0x5000c5002de3b9ca
alias d2 wwn-0x5000c5002def789e
Also perform some minor vdev_id cleanup, such as removal of the unused
-s command line option.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#981
Commit df83110856 missed update to
getopt() call, while delivering all the rest. This commit adds
"o" to getopt().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #566
While expanding positional parameters shell requires non-single
digits to be enclosed in braces. When the SAS topology is
non-trivial the number of positional parameters generated internally
by vdev_id script (using set -- ...) easily crosses single digit limit
and vdev_id fails to generate links.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1119
Full bash may not be available in all environments where udev helpers
run, such as in an initial ramdisk. To avoid breakage in this case,
remove use of bash-specific features such as variable arrays and the
`declare' keyword from the vdev_id script.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#870
Canonicalize the mount point passed to the mount.zfs helper.
This way a clean path is always added to mtab which ensures
the umount can properly locate and remove the entry.
Test case:
$ mkdir /mnt/foo
$ mount -t zfs zpool/foo /mnt/../mnt/foo////
$ umount /mnt/foo
$ cat /etc/mtab | grep zpool/foo
zpool/foo /mnt/../mnt/foo//// zfs rw 0 0
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#573
When adding devices to an existing pool "ashift" property is
auto-detected. However, if this property was overridden at
the pool creation time (i.e. zpool create -o ashift=12 tank ...)
this may not be what the user wants. This commit lets the user
specify the value of "ashift" property to be used with newly
added drives. For example,
zpool add -o ashift=12 tank disk1
zpool attach -o ashift=12 tank disk1 disk2
Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#566
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Refererces to Illumos issue:
https://www.illumos.org/issues/2671
This patch has been slightly modified from the upstream Illumos
version. In the upstream implementation a warning message is
logged to the console. To prevent pointless console noise this
notification is now posted as a "ereport.fs.zfs.vdev.bad_ashift"
event.
The event indicates a non-optimial (but entirely safe) ashift
value was used to create the pool. Depending on your workload
this may impact pool performance. Unfortunately, the only way
to correct the issue is to recreate the pool with a new ashift.
NOTE: The unrelated fix to the comment in zpool_main.c appears
in the upstream commit and was preserved for consistnecy.
Ported-by: Cyril Plisko <cyril.plisko@mountall.com>
Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#955
The interface for the ddt_zap_count() function assumes it can
never fail. However, internally ddt_zap_count() is implemented
with zap_count() which can potentially fail. Now because there
was no way to return the error to the caller a VERIFY was used
to ensure this case never happens.
Unfortunately, it has been observed that pools can be damaged in
such a way that zap_count() fails. The result is that the pool can
not be imported without hitting the VERIFY and crashing the system.
This patch reworks ddt_object_count() so the error can be safely
caught and returned to the caller. This allows a pool which has
be damaged in this way to be safely rewound for import.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#910
The 'zpool replace' command would fail when given a short name
because unlike on other platforms the short name cannot be
deterministically expanded to a single path. Multiple path
prefixes must be checked and in addition the partition suffix
for whole disks is determined by the prefix.
To handle this complexity a zfs_strcmp_pathname() function was
added which takes either a short or fully qualified device name.
Short names will be expanded using the prefixes in the default
import search path, or the ZPOOL_IMPORT_PATH environment variable
if it's defined. All posible expansions are then compared against
the comparison path. Care is taken to strip redundant slashes to
ensure legitimate matches are not missed.
In the context of this work the existing zfs_resolve_shortname()
function was extended to consider the ZPOOL_IMPORT_PATH when set.
The zfs_append_partition() interface was also simplified to take
only a single buffer.
The vast majority of these changes rework existing Linux specific
code which was originally written to accomidate udev. However,
there is some minimal cleanup which removes Illumos specific code.
This was done to improve readability but the basic flow and intent
of the upstream code was maintained.
These changes are the logical conclusion of the previos work to
adjust the 'zpool import' search behavior, see commit 44867b6a.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#544Closes#976
The ztest deadman timer has been causing false positives in the
testing VMs. To make it easier to spot possible regressions
I'm disabling this timer. The buildbot test infrastructure
will still mark ztest instances which take to long to complete
as failures.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1018
The realpath(3) function expects that when a buffer is passed
for the 'resolved_path' that it be at least PATH_MAX in length.
If it's not a buffer overflow may occur.
Therefore the passed buffer size is changed from MAXNAMELEN to
MAXPATHLEN. We also take this opertunity to dynamically allocate
the buffer to keep it off the stack.
warning: call to '__realpath_chk_warn' declared with attribute
warning: second argument of realpath must be either NULL or at
least PATH_MAX bytes long buffer [enabled by default]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Under Linux the following functions are flagged with the
attribute warn_unused_result, this triggers a warning when
ever they are used without checking the return value.
To handle this case we check the result VERIFY(). It's
better to detect this immediately on failure rather than
segfault farther down in the function.
../../cmd/ztest/ztest.c:6033:2: warning:
ignoring return value of 'asprintf', declared with
attribute warn_unused_result [-Wunused-result]
../../cmd/ztest/ztest.c:739:3: warning:
ignoring return value of 'realpath', declared with
attribute warn_unused_result [-Wunused-result]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The use of tempnam() is racy and it should be avoided in favor of
mkstemp(). According to the Linux tempnam(3) man page.
"Although tempnam() generates names that are difficult to guess,
it is nevertheless possible that between the time that tempnam()
returns a pathname, and the time that the program opens it, another
program might create that pathname using open(2), or create it as
a symbolic link. This can lead to security holes. To avoid such
possibilities, use the open(2) O_EXCL flag to open the pathname.
Or better yet, use mkstemp(3) or tmpfile(3)."
This issue was flagged by gcc.
ztest.o: In function `setup_data_fd': cmd/ztest/ztest.c:5822:
warning: the use of `tempnam' is dangerous, better use `mkstemp'
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To ensure ztest behaves as similarly as possible to the kernel
implementation of ZFS we attempt to honor the kernel stack limits.
This includes keeping the individual stack frame sizes under 1K
in size. We currently use gcc to detect and enforce this limit.
Therefore to get this building cleanly with full debugging enabled
the stack usage in the following functions has been reduced by
moving the buffer to the heap.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Currently, ztest expects to get 3 and 4 as the file descriptors for
data and random files, respectively. This is quite fragile and breaks
easily if ztest is run with these file descriptors already opened
(e.g. in a complex shell script).
This patch fixes the issue by removing the assumptions on the file
descriptor numbers that open() returns.
For the random file (/dev/urandom), the new code doesn't rely on a
shared file descriptor; instead, it reopens the file in the child.
For the data file, the new code writes the file descriptor number into
a "ZTEST_FD_DATA" environment variable so that it can be recovered
after the execv() call.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
illumos/illumos-gate@ad135b5d64
Illumos changeset: 13700:2889e2596bd6
Note that this is only a partial port of the aforementioned Illumos
changeset.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
Ported to zfsonlinux by: Etienne Dechamps <etienne.dechamps@ovh.net>
Currently, ztest fails with the following error:
error: Pool 'ztest' has encountered an uncorrectable I/O failure
and the failure mode property for this pool is set to panic.
We know how to fix it (see issue #939), but it may take some time
before we get around to merging the fix, which has some heavy
dependencies.
In the mean time, it is not ideal to be unable to use ztest just
because of a small isolated issue, so this patch works around the
problem by disabling the reguid test. This is just a temporary hack to
keep ztest usable.
The reguid test will be enabled again when the proper fix is merged.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#997
Currently, in several instances (but not all), ztest generates vdev
file paths using a statement similar to this:
snprintf(path, sizeof (path), ztest_dev_template, ...);
This worked fine until 40b84e7aec, which
changed path to be a pointer to the heap instead of an array allocated
on the stack. Before this change, sizeof(path) would return the size of
the array; now, it returns the size of the pointer instead.
As a result, the aforementioned sprintf statement uses the wrong size
and truncates the vdev file path to the first 4 or 8 bytes (depending
on the architecture). Typically, with default settings, the file path
will become "/tmp/zt" instead of "/test/ztest.XXX".
This issue only exists in ztest_vdev_attach_detach() and
ztest_fault_inject(), which explains why ztest doesn't fail right away.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989
Currently, thread_create(), when called in userspace, creates a
joinable (i.e. not detached thread). This is the pthread default.
Unfortunately, this does not reproduce kthreads behavior (kthreads
are always detached). In addition, this contradicts the original
Solaris code which creates userspace threads in detached mode.
These joinable threads are never joined, which leads to a leakage of
pthread thread objects ("zombie threads"). This in turn results in
excessive ressource consumption, and possible ressource exhaustion in
extreme cases (e.g. long ztest runs).
This patch fixes the issue by creating userspace threads in detached
mode. The only exception is ztest worker threads which are meant to be
joinable.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989
The goal of this change is to make 'zpool import' prefer to use
the peristent /dev/mapper or /dev/disk/by-* paths. These are far
preferable to the devices in /dev/ whos names are not persistent
and are determined by the order in which a device is detected.
This patch improves things by changing the default search path from
just to the top level /dev/ directory to (in order):
/dev/disk/by-vdev - Custom rules, use first if they exist
/dev/disk/zpool - Custom rules, use first if they exist
/dev/mapper - Use multipath devices before components
/dev/disk/by-uuid - Single unique entry and persistent
/dev/disk/by-id - May be multiple entries and persistent
/dev/disk/by-path - Encodes physical location and persistent
/dev/disk/by-label - Custom persistent labels
/dev - UNSAFE device names will change
The default search path can be overriden by setting the
ZPOOL_IMPORT_PATH environment variable. This must be a colon
delimited list of paths which are searched for vdevs. If the
'zpool import -d' option is specified only those listed paths
will be searched.
Finally, when multiple paths to the same device are found. If one
of the paths is an exact match for the path used last time to import
the pool it will be used. When there are no exact matches the
prefered path will be determined by the provided search order.
This means you can still import a pool and force specific names by
providing the -d <path> option. And the prefered names will persist
as long as those paths exist on your system.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#965
Without this fix the zdb printouts of ZIL data blocks look full of FF
due to printf() handling its arguments as int by default.
Here is the output before the fix
TX_WRITE len 4136, txg 1093817, seq 149231
foid 4242, offset 0, length f68
G FFFFFF8EFFFFFF87FFFFFF91FFFFFFCC 1c
FFFFFFAFFFFFFFC9FFFFFFBAZ FFFFFFC3
And the same after the fix
TX_WRITE len 4136, txg 1093817, seq 149231
foid 4242, offset 0, length f68
G 8E8791CC 1cAFC9BAZ C3
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#962
ztest outputs a message when testing sync=always no matter what the
verbosity level is. There is no point outputting this message for low
verbosity levels.
With this patch the message is only displayed at verbosity level 5 or
above. The result is less output pollution.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#951
Commit e6f290535c added libzpool to
the mount_zfs dependencies. This brought in the nvpair symbols
which are used by libzpool. To resolve this include the libnvpair
library for mount_zfs even though mount_zfs doesn't directly
require any of these symbols.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#926
mount_zfs depends on libzpool for zfs_prop_written since
330d06f90d. Unfortunately, the Makefile
for mount_zfs has not been modified to reflect this. As a result,
libtool doesn't know about the dependency, which may result in the wrong
libzpool being used during the build (e.g. the libzpool from the system
instead of the libzpool from the build directory).
This patch adds the dependency to fix the issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#909.
Remove all of the generated autotools products from the repository
and update the .gitignore files accordingly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#718
1796 "ZFS HOLD" should not be used when doing "ZFS SEND" from a read-only pool
2871 support for __ZFS_POOL_RESTRICT used by ZFS test suite
2903 zfs destroy -d does not work
2957 zfs destroy -R/r sometimes fails when removing defer-destroyed snapshot
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
https://www.illumos.org/issues/1796https://www.illumos.org/issues/2871https://www.illumos.org/issues/2903https://www.illumos.org/issues/2957
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <George.Wilson@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/2635
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#717
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/1693
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#678
Currently, zvols have a discard granularity set to 0, which suggests to
the upper layer that discard requests of arbirarily small size and
alignment can be made efficiently.
In practice however, ZFS does not handle unaligned discard requests
efficiently: indeed, it is unable to free a part of a block. It will
write zeros to the specified range instead, which is both useless and
inefficient (see dnode_free_range).
With this patch, zvol block devices expose volblocksize as their discard
granularity, so the upper layer is aware that it's not supposed to send
discard requests smaller than volblocksize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#862
1644 add ZFS "clones" property
1645 add ZFS "written" and "written@..." properties
1646 "zfs send" should estimate size of stream
1647 "zfs destroy" should determine space reclaimed by
destroying multiple snapshots
1708 adjust size of zpool history data
References:
https://www.illumos.org/issues/1644https://www.illumos.org/issues/1645https://www.illumos.org/issues/1646https://www.illumos.org/issues/1647https://www.illumos.org/issues/1708
This commit modifies the user to kernel space ioctl ABI. Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated. This change has reordered
all of the new ioctl()s to the end of the list. This should
help minimize this issue in the future.
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Albert Lee <trisk@opensolaris.org>
Approved by: Garrett D'Amore <garret@nexenta.com>
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#826Closes#664
This is done for compatibility with existing Linux infrastructure.
In particular, when using zfs as a root filesystem there are init
scripts which as part of shutdown remount root read-only. Also,
the new systemd infrastructure being used by Fedora expects to be
able to remount a file system read-write.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #847
The end_writeback() function was changed by moving the call to
inode_sync_wait() earlier in to evict(). This effecitvely changes
the ordering of the sync but it does not impact the details of
the zfs implementation.
However, as part of this change end_writeback() was renamed to
clear_inode() to reflect the new semantics. This change does
impact us and clear_inode() now maps to end_writeback() for
kernels prior to 3.5.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#784
The vmtruncate_range() support has been removed from the kernel in
favor of using the fallocate method in the file_operations table.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
The export_operations member ->encode_fh() has been updated to
take both the child and parent inodes. This interface used to
take the child dentry and a bool describing if the parent is needed.
NOTE: While updating this code I noticed that we do not currently
cleanly handle the case where we're passed a connectable parent.
This code should be audited to make sure we're doing the right thing.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
Currently, zpool online -e (dynamic vdev expansion) doesn't work on
whole disks because we're invoking ioctl(BLKRRPART) from userspace
while ZFS still has a partition open on the disk, which results in
EBUSY.
This patch moves the BLKRRPART invocation from the zpool utility to the
module. Specifically, this is done just before opening the device in
vdev_disk_open() which is called inside vdev_reopen(). This requires
jumping through some hoops to get to the disk device from the partition
device, and to make sure we can still open the partition after the
BLKRRPART call.
Note that this new code path is triggered on dynamic vdev expansion
only; other actions, like creating a new pool, are unchanged and still
call BLKRRPART from userspace.
This change also depends on API changes which are available in 2.6.37
and latter kernels. The build system has been updated to detect this,
but there is no compatibility mode for older kernels. This means that
online expansion will NOT be available in older kernels. However, it
will still be possible to expand the vdev offline.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#808
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Alexander Eremin <alexander.eremin@nexenta.com>
Reviewed by: Alexander Stetsenko <ams@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/1748
This commit modifies the user to kernel space ioctl ABI. Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated. If only the user space
component is updated both the 'zpool events' command and the
'zpool reguid' command will not work until the kernel modules
are updated.
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#665
FreeBSD #xxx: Dramatically optimize listing snapshots when user
requests only snapshot names and wants to sort them by name, ie.
when executes:
# zfs list -t snapshot -o name -s name
Because only name is needed we don't have to read all snapshot
properties.
Below you can find how long does it take to list 34509 snapshots
from a single disk pool before and after this change with cold and
warm cache:
before:
# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 525s
warm cache: 218s
after:
# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 1.7s
warm cache: 1.1s
NOTE: This patch only appears in FreeBSD. If/when Illumos picks up
the change we may want to drop this patch and adopt their version.
However, for now this addresses a real issue.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #450
This is not a proper fix. It is just a workaround for the stack
smashing detected by gcc in zvol_id. We simply disable the gcc
stack protector for now when building the zvol_id udev helper.
Once the root cause is resolved this patch should be reverted.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issues #569
When compiling ZFS with CFLAGS=-O0 it will trigger the following error.
Resolve the issue by properly including locale.h.
../../cmd/mount_zfs/mount_zfs.c: In function 'main':
../../cmd/mount_zfs/mount_zfs.c:318:2: warning: implicit declaration
of function 'setlocale' [-Wimplicit-function-declaration]
../../cmd/mount_zfs/mount_zfs.c:318:19: error: 'LC_ALL' undeclared
(first use in this function)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#724
torvalds/linux@adc0e91ab1 introduced
introduced d_make_root() as a replacement for d_alloc_root(). Further
commits appear to have removed d_alloc_root() from the Linux source
tree. This causes the following failure:
error: implicit declaration of function 'd_alloc_root'
[-Werror=implicit-function-declaration]
To correct this we update the code to use the current d_make_root()
interface for readability. Then we introduce an autotools check
to determine if d_make_root() is available. If it isn't then we
define some compatibility logic which used the older d_alloc_root()
interface.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#776
vdev_id parses the file /etc/zfs/vdev_id.conf to map a physical path
in a storage topology to a channel name. The channel name is combined
with a disk enclosure slot number to create an alias that reflects the
physical location of the drive. This is particularly helpful when it
comes to tasks like replacing failed drives. Slot numbers may also be
re-mapped in case the default numbering is unsatisfactory. The drive
aliases will be created as symbolic links in /dev/disk/by-vdev.
The only currently supported topologies are sas_direct and sas_switch:
o sas_direct - a channel is uniquely identified by a PCI slot and a
HBA port
o sas_switch - a channel is uniquely identified by a SAS switch port
A multipath mode is supported in which dm-mpath devices are handled by
examining the first running component disk, as reported by 'multipath
-l'. In multipath mode the configuration file should contain a
channel definition with the same name for each path to a given
enclosure.
vdev_id can replace the existing zpool_id script on systems where the
storage topology conforms to sas_direct or sas_switch. The script
could be extended to support other topologies as well. The advantage
of vdev_id is that it is driven by a single static input file that can
be shared across multiple nodes having a common storage toplogy.
zpool_id, on the other hand, requires a unique /etc/zfs/zdev.conf per
node and a separate slot-mapping file. However, zpool_id provides the
flexibility of using any device names that show up in
/dev/disk/by-path, so it may still be needed on some systems.
vdev_id's functionality subsumes that of the sas_switch_id script, and
it is unlikely that anyone is using it, so sas_switch_id is removed.
Finally, /dev/disk/by-vdev is added to the list of directories that
'zpool import' will scan.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#713
The mode argument of iops->create()/mkdir()/mknod() was changed from
an 'int' to a 'umode_t'. To prevent a compiler warning an autoconf
check was added to detect the API change and then correctly set a
zpl_umode_t typedef. There is no functional change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#701
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Joshua M. Clulow <josh@sysmgr.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Reference to Illumos issue:
https://www.illumos.org/issues/1946
Ported by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>
Refererce to Illumos issue:
https://www.illumos.org/issues/952
Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#607
Add support for the `zfs list -t snap` alias which is available under
Oracle Solaris 11.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#640
For consistency, and because it's handy, add the 'zfs snap' alias which
was introduced by Oracle Solaris 11. This includes an update to the
man page to reflect all the available alias (snap, umount, and recv).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#640
A cryptic error code is printed when mounting a legacy dataset to a
non-existent mountpoint. This patch changes this behavior to print
"mount point '%s' does not exist", which is similar to the error
message printed when mounting procfs.
The single quotes were added to be consistent with the existing EBUSY
error message, which is the only difference between this error message
and the one that is printed when the same condition occurs when mounting
procfs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#633
When stdout is detected to be a tty use the number of columns
specified by the terminal. If that fails fall back to a default
80 column width. In the non-tty case allow for 999 column lines.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Caught by lint, this permission change was accidentally introduced
by commit 42cb3819f1. Restore the
correct permissions and while I'm at it add a missing whack-bang
to config/ltmain.sh.
lint: executable-not-elf-or-script: zpool_main.c zfs_main.c
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#620
Allow rigorous (and expensive) tx validation to be enabled/disabled
indepentantly from the standard zfs debugging. When enabled these
checks ensure that all txs are constructed properly and that a dbuf
is never dirtied without taking the correct tx hold.
This checking is particularly helpful when adding new dmu consumers
like Lustre. However, for established consumers such as the zpl
with no known outstanding tx construction problems this is just
overhead.
--enable-debug-dmu-tx - Enable/disable validation of each tx as
--disable-debug-dmu-tx it is constructed. By default validation
is disabled due to performance concerns.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add support for the .zfs control directory. This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required. The bulk of the core
functionality is now all there with the following limitations.
*) The .zfs/snapshot directory automount support requires a 2.6.37
or newer kernel. The exception is RHEL6.2 which has backported
the d_automount patches.
*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
in the .zfs/snapshot directory works as expected. However,
this functionality is only available to root until zfs
delegations are finished.
* mkdir - create a snapshot
* rmdir - destroy a snapshot
* mv - rename a snapshot
The following issues are known defeciences, but we expect them to
be addressed by future commits.
*) Add automount support for kernels older the 2.6.37. This should
be possible using follow_link() which is what Linux did before.
*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
The majority of the ground work for this is complete. However,
finishing this work will require resolving some lingering
integration issues with the Linux NFS kernel server.
*) The .zfs/shares directory exists but no futher smb functionality
has yet been implemented.
Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#173
The 'zfs list' and 'zpool list' commands output the message
'no datasets/pools available' to stdout. This should go to
stderr and only the available datasets/pools should go to
stdout. Returning nothing to stdout is expected behavior
when there is nothing to list.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#581
Allow a source rpm to be rebuilt with debugging enabled. This
avoids the need to have to manually modify the spec file. By
default debugging is still largely disabled. To enable specific
debugging features use the following options with rpmbuild.
'--with debug' - Enables ASSERTs
# For example:
$ rpmbuild --rebuild --with debug zfs-modules-0.6.0-rc6.src.rpm
Additionally, ZFS_CONFIG has been added to zfs_config.h for
packages which build against these headers. This is critical
to ensure both zfs and the dependant package are using the same
prototype and structure definitions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When creating a new pool, make_root_vdev() calls check_in_use() to
ensure that none of the consituent disks are in use. If the disk
contains a valid vdev label it is read to retrieve the list of its
child vdevs and these are checked recursively. However, the
partitions stored in the vdev label my no longer exist, for example
if the partition table has since been altered. In any such case we
would want the pool creation to proceed, so this change removes the
check from check_slice() that returns an error if the device doesn't
exist. As an added assurance, the Solaris implementation also
returns sucess on ENOENT.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning.
It allows ZVOL clients to discard (unmap, trim) block ranges from
a ZVOL, thus optimizing disk space usage by allowing a ZVOL to
shrink instead of just grow.
We can't use zfs_space() or zfs_freesp() here, since these functions
only work on regular files, not volumes. Fortunately we can use the
low-level function dmu_free_long_range() which does exactly what we
want.
Currently the discard operation is not added to the log. That's not
a big deal since losing discard requests cannot result in data
corruption. It would however result in disk space usage higher than
it should be. Thus adding log support to zvol_discard() is probably
a good idea for a future improvement.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is
supported, since it's the only one that matches the behavior of
zfs_space(). This makes it pretty much useless in its current
form, but it's a start.
To support other flag combinations we would need to modify
zfs_space() to make it more flexible, or emulate the desired
functionality in zpl_fallocate().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.
Detailed rationale:
- max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
handle writes of any size, so there's no reason to impose a limit. Let the
upper layer decide.
- max_segments and max_segment_size are set to unlimited. zvol_write() will
copy the requests' contents into a dbuf anyway, so the number and size of
the segments are irrelevant. Let the upper layer decide.
- physical_block_size and io_opt are set to the ZVOL's block size. This
has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
the upper layers that writes smaller than the volume's block size will be
slow.
- The NONROT flag is set to indicate this isn't a rotational device.
Although the backing zpool might be composed of rotational devices, the
resulting ZVOL often doesn't exhibit the same behavior due to the COW
mechanisms used by ZFS. Setting this flag will prevent upper layers from
making useless decisions (such as reordering writes) based on incorrect
assumptions about the behavior of the ZVOL.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:
WRITE: A normal async write. Device will be plugged.
WRITE_SYNC: Synchronous write. Identical to WRITE, but passes down
the hint that someone will be waiting on this IO
shortly.
WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
WRITE_FUA: Like WRITE_SYNC but data is guaranteed to be on
non-volatile media on completion.
In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.
The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.
Also, we must inform the block layer that we actually support FLUSH and FUA.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The second argument of sops->show_options() was changed from a
'struct vfsmount *' to a 'struct dentry *'. Add an autoconf check
to detect the API change and then conditionally define the expected
interface. In either case we are only interested in the zfs_sb_t.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#549
These libraries, which are an artifact of the ZoL development
process, conflict with packages that are already in distribution:
* libspl: SPL Programming Language
* libavl: AVL for Linux
* libefi: GRUB
And these libraries are potential conflicts:
* libshare: the Linux Mount Manager
* libunicode: Perl and Python
Recompose these five ZoL components into the four libraries that are
conventionally provided by Solaris and FreeBSD systems:
+ libnvpair
+ libuutil
+ libzpool
+ libzfs
This change resolves the name conflict, makes ZoL more compatible
with existing software that uses autotools to detect ZFS, and allows
pkg-zfs to better reflect the official Debian kFreeBSD packaging.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #430
Linux supports mounting over non-empty directories by default.
In Solaris this is not the case and -O option is required for
zfs mount to mount a zfs filesystem over a non-empty directory.
For compatibility, I've added support for -O option to mount
zfs filesystems over non-empty directories if the user wants
to, just like in Solaris.
I've defined MS_OVERLAY to record it in the flags variable if
the -O option is supplied. The flags variable passes through
a few functions and its checked before performing the empty
directory check in zfs_mount function. If -O is given, the
check is not performed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#473
Some implementations of `awk` incorrectly parse the \< and \> regex
symbols, so use a `while read` loop and regular globbing instead.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #259
The Linux 3.1 kernel has introduced the concept of per-filesystem
shrinkers which are directly assoicated with a super block. Prior
to this change there was one shared global shrinker.
The zfs code relied on being able to call the global shrinker when
the arc_meta_limit was exceeded. This would cause the VFS to drop
references on a fraction of the dentries in the dcache. The ARC
could then safely reclaim the memory used by these entries and
honor the arc_meta_limit. Unfortunately, when per-filesystem
shrinkers were added the old interfaces were made unavailable.
This change adds support to use the new per-filesystem shrinker
interface so we can continue to honor the arc_meta_limit. The
major benefit of the new interface is that we can now target
only the zfs filesystem for dentry and inode pruning. Thus we
can minimize any impact on the caching of other filesystems.
In the context of making this change several other important
issues related to managing the ARC were addressed, they include:
* The dnlc_reduce_cache() function which was called by the ARC
to drop dentries for the Posix layer was replaced with a generic
zfs_prune_t callback. The ZPL layer now registers a callback to
drop these dentries removing a layering violation which dates
back to the Solaris code. This callback can also be used by
other ARC consumers such as Lustre.
arc_add_prune_callback()
arc_remove_prune_callback()
* The arc_reduce_dnlc_percent module option has been changed to
arc_meta_prune for clarity. The dnlc functions are specific to
Solaris's VFS and have already been largely eliminated already.
The replacement tunable now represents the number of bytes the
prune callback will request when invoked.
* Less aggressively invoke the prune callback. We used to call
this whenever we exceeded the arc_meta_limit however that's not
strictly correct since it results in over zeleous reclaim of
dentries and inodes. It is now only called once the arc_meta_limit
is exceeded and every effort has been made to evict other data from
the ARC cache.
* More promptly manage exceeding the arc_meta_limit. When reading
meta data in to the cache if a buffer was unable to be recycled
notify the arc_reclaim thread to invoke the required prune.
* Added arcstat_prune kstat which is incremented when the ARC
is forced to request that a consumer prune its cache. Remember
this will only occur when the ARC has no other choice. If it
can evict buffers safely without invoking the prune callback
it will.
* This change is also expected to resolve the unexpect collapses
of the ARC cache. This would occur because when exceeded just the
arc_meta_limit reclaim presure would be excerted on the arc_c
value via arc_shrink(). This effectively shrunk the entire cache
when really we just needed to reclaim meta data.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#466Closes#292
Regenerating the autotools configuration on Debian and Ubuntu systems
causes compilation to fail with this error message:
cmd/mount_zfs/../../cmd/mount_zfs/mount_zfs.c:403:
undefined reference to `is_selinux_enabled'
In the automake template, set "mount_zfs_LDFLAGS = ... $(LIBSELINUX)"
so that the /sbin/mount.zfs utility is linked to libselinux.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit
SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170
Use the new set_nlink() kernel function instead.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #462
Added the necessary build infrastructure for building packages
compatible with the Arch Linux distribution. As such, one can now run:
$ ./configure
$ make pkg # Alternatively, one can run 'make arch' as well
on the Arch Linux machine to create two binary packages compatible with
the pacman package manager, one for the zfs userland utilities and
another for the zfs kernel modules. The new packages can then be
installed by running:
# pacman -U $package.pkg.tar.xz
In addition, source-only packages suitable for an Arch Linux chroot
environment or remote builder can also be build using the 'sarch' make
rule.
NOTE: Since the source dist tarball is created on the fly from the head
of the build tree, it's MD5 hash signature will be continually influx.
As a result, the md5sum variable was intentionally omitted from the
PKGBUILD files, and the '--skipinteg' makepkg option is used. This may
or may not have any serious security implications, as the source tarball
is not being downloaded from an outside source.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#491
The zpool_id script is posixly correct and does not use bash
features, so change its whackbang from /bin/bash to /bin/sh.
Debian policy also stipulates that system scripts be dash compatible.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Direct invocation of GNU egrep is deprecated by its man page, and the
its argument in the zpool_id script is not an extended expression.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
For consistency and safety, quote all variables in the zpool_id
script. This accomodates a `-c CONFIG` parameter value with
whitespace in the path name.
Also fix a typo in the usage synopsis for `-h`.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #439
The /lib/udev/path_id helper became a builtin command in the udev 174
release, so test whether path_id is external in the zpool_id script.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #429
Update the code to use the bdi_setup_and_register() helper to
simplify the bdi integration code. The updated code now just
registers the bdi during mount and destroys it during unmount.
The only complication is that for 2.6.32 - 2.6.33 kernels the
helper wasn't available so in these cases the zfs code must
provide it. Luckily the bdi_setup_and_register() function
is trivial.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#367
ZFS contains error messages that point to the defunct www.sun.com
domain, which is currently offline. Change these error messages
to use the zfsonlinux.org mirror instead.
This commit depends on:
zfsonlinux/zfsonlinux.github.com@8e10ead3dc
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Export all symbols already marked extern in the zfs_vfsops.h
header. Several non-static symbols have also been added to
the header and exportewd. This allows external modules to
more easily create and manipulate properly created ZFS
filesystem type datasets.
Rename zfsvfs_teardown() to zfs_sb_teardown and export it.
This is done simply for consistency with the rest of the code
base. All other zfsvfs_* functions have already been renamed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When compiling under Debian Lenny with gcc version 4.3.2
(Debian 4.3.2-1.1) the following warning occurs. To quiet
the warning initialize 'error' to zero. Newer versions of
gcc correctly determine that this uninitialized varible is
impossible because ZFS_NUM_USERQUOTA_PROPS is known to be
greater than zero.
cmd/zfs/zfs_main.c:2377: warning: "error" may be
used uninitialized in this function
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This warning was accidentally introduced by commit
b7936d5c2337bc976ac831c1c38de563844c36b. The fix is to
simply initialize the variable to ZFS_DELEG_WHO_UNKNOWN.
cmd/zfs/zfs_main.c:4460:25: warning: 'who_type' may be
used uninitialized in this function
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
These warnings were accidentally introduced by commit
b7936d5c2337bc976ac831c1c38de563844c36b. The fix is to
simply add the missing format specifier.
cmd/zfs/zfs_main.c:4565: warning: format not a string
literal and no format arguments
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Completely disable the zfs binary from attempting to directly update
/etc/mtab. The Linux port relies entirely on the mount.zfs helper
to safely update /etc/mtab. If we left the /etc/mtab updates to
the zfs binary then they could race with concurrent non-zfs mounts.
Routing everything through the system mount command ensures the
/etc/mtab updates are locked properly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #329
This change moves the default install location for the zfs udev
rules from /etc/udev/ to /lib/udev/. The correct convention is
for rules provided by a package to be installed in /lib/udev/.
The /etc/udev/ directory is reserved for custom rules or local
overrides.
Additionally, this patch cleans up some abuse of the bindir install
location by adding a udevdir and udevruledir install directories.
This allows us to revert to the default bin install location. The
udev install directories can be set with the following new options.
--with-udevdir=DIR install udev helpers [EPREFIX/lib/udev]
--with-udevruledir=DIR install udev rules [UDEVDIR/rules.d]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#356
For a long time now the kernel has been moving away from using the
pdflush daemon to write 'old' dirty pages to disk. The primary reason
for this is because the pdflush daemon is single threaded and can be
a limiting factor for performance. Since pdflush sequentially walks
the dirty inode list for each super block any delay in processing can
slow down dirty page writeback for all filesystems.
The replacement for pdflush is called bdi (backing device info). The
bdi system involves creating a per-filesystem control structure each
with its own private sets of queues to manage writeback. The advantage
is greater parallelism which improves performance and prevents a single
filesystem from slowing writeback to the others.
For a long time both systems co-existed in the kernel so it wasn't
strictly required to implement the bdi scheme. However, as of
Linux 2.6.36 kernels the pdflush functionality has been retired.
Since ZFS already bypasses the page cache for most I/O this is only
an issue for mmap(2) writes which must go through the page cache.
Even then adding this missing support for newer kernels was overlooked
because there are other mechanisms which can trigger writeback.
However, there is one critical case where not implementing the bdi
functionality can cause problems. If an application handles a page
fault it can enter the balance_dirty_pages() callpath. This will
result in the application hanging until the number of dirty pages in
the system drops below the dirty ratio.
Without a registered backing_device_info for the filesystem the
dirty pages will not get written out. Thus the application will hang.
As mentioned above this was less of an issue with older kernels because
pdflush would eventually write out the dirty pages.
This change adds a backing_device_info structure to the zfs_sb_t
which is already allocated per-super block. It is then registered
when the filesystem mounted and unregistered on unmount. It will
not be registered for mounted snapshots which are read-only. This
change will result in flush-<pool> thread being dynamically created
and destroyed per-mounted filesystem for writeback.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#174
Moving the zil_free() cleanup to zil_close() prevents this
problem from occurring in the first place. There is a very
good description of the issue and fix in Illumus #883.
Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Reivewed by: Dan McDonald <danmcd@nexenta.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/883
- https://github.com/illumos/illumos-gate/commit/c9ba2a43cb
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Today zfs tries to allocate blocks evenly across all devices.
This means when devices are imbalanced zfs will use lots of
CPU searching for space on devices which tend to be pretty
full. It should instead fail quickly on the full LUNs and
move onto devices which have more availability.
Reviewed by: Eric Schrock <Eric.Schrock@delphix.com>
Reviewed by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Gordon Ross <gwr@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/510
- https://github.com/illumos/illumos-gate/commit/5ead3ed965
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
The 'zfs get' command should be able to deal with mountpoint
as an argument. It already works with 'zfs list' command:
# zfs list /export/home/estibi
NAME USED AVAIL REFER MOUNTPOINT
rpool/export/home/estibi 1.14G 3.86G 1.14G /export/home/estibi
but it fails with 'zfs get':
# zfs get all /export/home/estibi
cannot open '/export/home/estibi': invalid dataset name
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Deano <deano@rattie.demon.co.uk>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/510
- https://github.com/illumos/illumos-gate/commit/5ead3ed965
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Unlike most other Linux distributions archlinux installs its
init scripts in /etc/rc.d insead of /etc/init.d. This commit
provides an archlinux rc.d script for zfs and extends the
build infrastructure to ensure it get's installed in the
correct place.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#322
Unfortunately, ztest is hard coded to export the zdb utility to
be installed in a certain location. When the packaging was updated
to install zdb in /sbin/ ztest was broken. To fix this I'm updating
ztest to check both common install paths.
The zpool sub-commands like iostat, list, and status should
display consistent message when a given pool is unavailable or
no pool is present. This change unifies the default behavior
as follows:
root@prasad:~# ./zpool list 1 2
no pools available
no pools available
root@prasad:~# ./zpool iostat 1 2
no pools available
no pools available
root@prasad:~# ./zpool status 1 2
no pools available
no pools available
root@prasad:~# ./zpool list tan 1 2
cannot open 'tan': no such pool
root@prasad:~# ./zpool iostat tan 1 2
cannot open 'tan': no such pool
root@prasad:~# ./zpool status tan 1 2
cannot open 'tan': no such pool
Reported-by: Rajshree Thorat <rthorat@stec-inc.com>
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#306