It's possible for an unrelated process, like blkid, to have the
volume open when 'zfs destroy' is run. Switch the cleanup function
to the destroy_dataset() helper which handles this case by retrying
the destroy when the dataset is busy.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7750
The RT rwsem implementation was changed to allow multiple readers
as of the 4.9.20-rt16 patch set. This results in a build failure
because the existing implementation was forced to directly access
the rwsem structure which has changed.
While this could be accommodated by adding additional compatibility
code. This patch resolves the build issue by simply assuming the
rwsem can never be upgraded. This functionality is a performance
optimization and all callers must already handle this case.
Converting the last remaining use of __SPIN_LOCK_UNLOCKED to
spin_lock_init() was additionally required to get a clean build.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7589
Systemd binaries necessary for mounting an encrypted root dataset
weren't copied to initramfs generated by dracut. This patch fixes
this and copies these binaries unconditionally, that is
regardless of whether native ZFS encryption is used for the
root dataset.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Diamantopoulos <georgediam@gmail.com>
Closes#7607Closes#7719
Porting notes:
* As of grub-2.02 these checksums are not supported. However, as
pointed out in #6501 there are alternatives such as EFISTUB which
work and have no such restriction. A warning was added to the
checksum property section of the zfs.8 man page.
Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: C Fraire <cfraire@me.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/8906
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52fCloses#6501Closes#7714
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Albert Lee <trisk@forkgnu.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Updates to indirect blocks of spacemaps can contribute significantly to
write inflation. Therefore we want to reduce the indirect block size of
spacemaps from 128K to 16K.
Porting notes:
* Refactored to allow the dmu_object_alloc(), dmu_object_alloc_ibs()
and dmu_object_alloc_dnsize() functions to use a common shared
dmu_object_alloc_impl() function.
OpenZFS-issue: https://www.illumos.org/issues/9442
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0c2e6408bCloses#7712
The reservation_008_pos test case has been observed to fail in
a non-dangerous way in approximately 5% of automated test runs.
Add the test case to the list of possible expected failures
until the test case can be made perfectly reliable.
Reviewed by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7741Closes#7742
It is helpful to tune zfs_per_txg_dirty_frees_percent for commit
539d33c7(OpenZFS 6569 - large file delete can starve out write ops).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Feng Sun <loyou85@gmail.com>
Closes#7718
A memory leak occurs on lines 209 and 213 because the config is not
freed in the error case. The interface to add_config() seems less than
ideal - it would be better if it copied any data necessary from the
config and the caller freed it.
Porting notes:
* This issue had already been resolved on Linux by adding the missing
calls to nvlist_free(). But we'll adopt the upstream fix to keep
the behavior of the code consistent.
Authored by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9457
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/be86bb8aCloses#7713
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
While investigating a different problem, I noticed that moved dnodes
(those processed by dnode_move_impl() via kmem_move()) have an incorrect
dn_next_type. This could cause the on-disk dn_type to be changed to an
invalid value. The fix to copy the dn_next_type in dnode_move_impl().
Porting notes:
* For the moment this potential issue cannot occur on Linux since
the SPL does not provide the kmem_move() functionality.
OpenZFS-issue: https://illumos.org/issues/9338
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0717e6f13Closes#7715
The arc_hdr_realloc_crypt() function is responsible for converting
a "full" arc header to an extended "crypt" header and visa versa.
This code was originally written with a bcopy() so that any new
members added to arc headers would automatically be included
without requiring a code change. However, in practice this (along
with small differences in kmem_cache implementations between
various platforms) has caused a number of hard-to-find problems in
ports to other operating systems. This patch solves this problem
by making all member copies explicit and adding ASSERTs for fields
that cannot be set during the transfer. It also manually resets the
old header after the reallocation is finished so it can be properly
reallocated and reused.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7711
We were doing count_block() twice inside this function, once
unconditionally at the beginning (intended to catch the embedded block
case) and once near the end after processing the block.
The double-accounting caused the "zpool scrub" progress statistics in
"zpool status" to climb from 0% to 200% instead of 0% to 100%, and
showed double the I/O rate it was actually seeing.
This was apparently a regression introduced in commit 00c405b4b5,
which was an incorrect port of this OpenZFS commit:
https://github.com/openzfs/openzfs/commit/d8a447a7
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Steven Noonan <steven@uplinklabs.net>
Closes#7720Closes#7738
It's often necessary to understand why a change is made, before
understanding the exact changes that are made. Context provides
background, which by definition is necessary to understand prior to the
substance of the Pull Request.
Change the PR template to request "Motivation and Context" first, before
"Description".
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#7737
While the autoexpand property may seem like a small feature it
depends on a significant amount of system infrastructure. Enough
of that infrastructure is now in place that with a few modifications
for Linux it can be supported.
Auto-expand works as follows; when a block device is modified
(re-sized, closed after being open r/w, etc) a change uevent is
generated for udev. The ZED, which is monitoring udev events,
passes the change event along to zfs_deliver_dle() if the disk
or partition contains a zfs_member as identified by blkid.
From here the device is matched against all imported pool vdevs
using the vdev_guid which was read from the label by blkid. If
a match is found the ZED reopens the pool vdev. This re-opening
is important because it allows the vdev to be briefly closed so
the disk partition table can be re-read. Otherwise, it wouldn't
be possible to report the maximum possible expansion size.
Finally, if the property autoexpand=on a vdev expansion will be
attempted. After performing some sanity checks on the disk to
verify that it is safe to expand, the primary partition (-part1)
will be expanded and the partition table updated. The partition
is then re-opened (again) to detect the updated size which allows
the new capacity to be used.
In order to make all of the above possible the following changes
were required:
* Updated the zpool_expand_001_pos and zpool_expand_003_pos tests.
These tests now create a pool which is layered on a loopback,
scsi_debug, and file vdev. This allows for testing of non-
partitioned block device (loopback), a partition block device
(scsi_debug), and a file which does not receive udev change
events. This provided for better test coverage, and by removing
the layering on ZFS volumes there issues surrounding layering
one pool on another are avoided.
* zpool_find_vdev_by_physpath() updated to accept a vdev guid.
This allows for matching by guid rather than path which is a
more reliable way for the ZED to reference a vdev.
* Fixed zfs_zevent_wait() signal handling which could result
in the ZED spinning when a signal was not handled.
* Removed vdev_disk_rrpart() functionality which can be abandoned
in favor of kernel provided blkdev_reread_part() function.
* Added a rwlock which is held as a writer while a disk is being
reopened. This is important to prevent errors from occurring
for any configuration related IOs which bypass the SCL_ZIO lock.
The zpool_reopen_007_pos.ksh test case was added to verify IO
error are never observed when reopening. This is not expected
to impact IO performance.
Additional fixes which aren't critical but were discovered and
resolved in the course of developing this functionality.
* Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for
ZFS volumes. This is as good as a unique physical path, while the
volumes are not used in the test cases anymore for other reasons
this improvement was included.
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#120Closes#2437Closes#5771Closes#7366Closes#7582Closes#7629
This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much). There are two parts:
The dbuf_metadata_cache. We identify what to put into the cache based on
the object type of each dbuf. Caching objset properties os
{version,normalization,utf8only,casesensitivity} in the objset_t. The reason
these needed to be cached is that although they are queried frequently,
they aren't stored in a dbuf type which we can easily recognize and cache in
the dbuf layer; instead, we have to explicitly store them. There's already
existing infrastructure for maintaining cached properties in the objset
setup code, so I simply used that.
Performance Testing:
- Disabled kmem_flags
- Tuned dbuf_cache_max_bytes very low (128K)
- Tuned zfs_arc_max very low (64M)
Created test pool with 400 filesystems, and 100 snapshots per filesystem.
Later on in testing, added 600 more filesystems (with no snapshots) to make
sure scaling didn't look different between snapshots and filesystems.
Results:
| Test | Time (trunk / diff) | I/Os (trunk / diff) |
+------------------------+---------------------+---------------------+
| zpool import | 0:05 / 0:06 | 12.9k / 12.9k |
| zfs get all (uncached) | 1:36 / 0:53 | 16.7k / 5.7k |
| zfs get all (cached) | 1:36 / 0:51 | 16.0k / 6.0k |
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
OpenZFS-issue: https://illumos.org/issues/9337
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52fCloses#7668
Commit 93b43af10 inadvertently introduced the following scenario which
can result in a deadlock. This issue was most easily reproduced by
LXD containers using a ZFS storage backend but should be reproducible
under any workload which is frequently mounting and unmounting.
-- THREAD A --
spa_sync()
spa_sync_upgrades()
rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B
-- THREAD B --
mount_fs()
zpl_mount()
zpl_mount_impl()
dmu_objset_hold()
dmu_objset_hold_flags()
dsl_pool_hold()
dsl_pool_config_enter()
rrw_enter(&dp->dp_config_rwlock, RW_READER, tag);
sget()
sget_userns()
grab_super()
down_write(&s->s_umount); <- Waiting on C
-- THREAD C --
cleanup_mnt()
deactivate_super()
down_write(&s->s_umount);
deactivate_locked_super()
zpl_kill_sb()
kill_anon_super()
generic_shutdown_super()
sync_filesystem()
zpl_sync_fs()
zfs_sync()
zil_commit()
txg_wait_synced() <- Waiting on A
Reviewed by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7598Closes#7659Closes#7691Closes#7693
Update the SA_COPY_DATA macro to check if architecture supports
efficient unaligned memory accesses at compile time. Otherwise
fallback to using the sa_copy_data() function.
The kernel provided CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is
used to determine availability in kernel space. In user space
the x86_64, x86, powerpc, and sometimes arm architectures will
define the HAVE_EFFICIENT_UNALIGNED_ACCESS macro.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7642Closes#7684
Ztest failed with the following crash.
::status
debugging core file of ztest (64-bit) from clone-dc-slave-280-bc7947b1.dcenter
file: /usr/bin/amd64/ztest
initial argv: /usr/bin/amd64/ztest
threading model: raw lwps
status: process terminated by SIGABRT (Abort), pid=2150 uid=1025 code=-1
panic message: failure for thread 0xfffffd7fff112a40, thread-id 1: unprotected error in call to Lua API (Invalid
value type 'function' for key 'error')
::stack
libc.so.1`_lwp_kill+0xa()
libc.so.1`_assfail+0x182(fffffd7fffdfe8d0, 0, 0)
libc.so.1`assfail+0x19(fffffd7fffdfe8d0, 0, 0)
libzpool.so.1`vpanic+0x3d(fffffd7ffaa58c20, fffffd7fffdfeb00)
0xfffffd7ffaa28146()
0xfffffd7ffaa0a109()
libzpool.so.1`luaD_throw+0x86(3011a48, 2)
0xfffffd7ffa9350d3()
0xfffffd7ffa93e3f1()
libzpool.so.1`zcp_lua_to_nvlist+0x33(3011a48, 1, 2686470, fffffd7ffaa2e2c3)
libzpool.so.1`zcp_convert_return_values+0xa4(3011a48, 2686470, fffffd7ffaa2e2c3, fffffd7fffdfedd0)
libzpool.so.1`zcp_pool_error+0x59(fffffd7fffdfedd0, 1e0f450)
libzpool.so.1`zcp_eval+0x6f8(1e0f450, fffffd7ffaa483f8, 1, 0, 6400000, 1d33b30)
libzpool.so.1`dsl_destroy_snapshots_nvl+0x12c(2786b60, 0, 484750)
libzpool.so.1`dsl_destroy_snapshot+0x4f(fffffd7fffdfef70, 0)
ztest_dsl_dataset_cleanup+0xea(fffffd7fffdff4c0, 1)
ztest_dataset_destroy+0x53(1)
ztest_run+0x59f(fffffd7fff0e0498)
main+0x7ff(1, fffffd7fffdffa88)
_start+0x6c()
The problem is that zcp_convert_return_values() assumes that there's
exactly one value on the stack, but that isn't always true. It ends up
putting the wrong thing on the stack which is then consumed by
zcp_convert_return values, which either adds the wrong message to the
nvlist, or blows up.
The fix is to make sure that callers of zcp_convert_return_values()
clear the stack before pushing their error message, and
zcp_convert_return_values() should VERIFY that the stack is the expected
size.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
OpenZFS-issue: https://www.illumos.org/issues/9424
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/eb7e57429Closes#7696
When running zdb without additional arguments against a pool containing
a checkpoint the entire checkpoint spacemap should not be dumped. Make
this behavior conditional upon passing the -mmmm option as described in
the zdb(8) man page.
-mmmm Display every spacemap record.
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7702
When we do a scrub or resilver, ZFS counts the different types of blocks,
which can be printed by the ::zfs_blkstats mdb dcmd. However, it fails to
count embedded blocks.
Porting notes:
* Commit d4a72f23 moved count_blocks under a BP_IS_EMBEDDED conditional
as part of the sequential resilver functionality. Since phys_birth
would be zero that case should never happen as described above. This
is confirmed by the code coverage analysis. Remove the conditional
to realign that aspect of this function with OpenZFS.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
OpenZFS-issue: https://www.illumos.org/issues/9454
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d8a447a7Closes#7697
Problem
=======
Illumos bug 8373 was integrated, which now presents a code path where
"dmu_tx_assign" can fail. When "dmu_tx_assign" fails, it will not issue
the lwb that was passed in to "zil_lwb_write_issue". As a result, when
"zil_lwb_write_issue" returns, the lwb will still be in the "opened"
state, just as it was when "zil_lwb_write_issue" was originally called.
Solution
========
As a result of this new call path, the failed assertion needs to be
modified to be aware of this new possibility. Thus, we can only assert
that the lwb is no longer in the "opened" state if the returned lwb is
non-null, since we cannot differentiate between the case of
"dmu_tx_assign" failing or "zio_alloc_zil" failing within the call to
"zil_lwb_write_issue".
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com>
OpenZFS-issue: https://www.illumos.org/issues/9456
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a8b09f4eCloses#7695
Datasets that are deeply nested (~100 levels) are impractical. We just
put a limit of 50 levels to newly created datasets. Existing datasets
should work without a problem.
The problem can be seen by attempting to create a dataset using the -p
option with many levels:
panic[cpu0]/thread=ffffff01cd282c20: BAD TRAP: type=8 (#df Double fault) rp=ffffffff
fffffffffbc3aa60 unix:die+100 ()
fffffffffbc3ab70 unix:trap+157d ()
ffffff00083d7020 unix:_patch_xrstorq_rbx+196 ()
ffffff00083d7050 zfs:dbuf_rele+2e ()
...
ffffff00083d7080 zfs:dsl_dir_close+32 ()
ffffff00083d70b0 zfs:dsl_dir_evict+30 ()
ffffff00083d70d0 zfs:dbuf_evict_user+4a ()
ffffff00083d7100 zfs:dbuf_rele_and_unlock+87 ()
ffffff00083d7130 zfs:dbuf_rele+2e ()
... The block above repeats once per directory in the ...
... create -p command, working towards the root ...
ffffff00083db9f0 zfs:dsl_dataset_drop_ref+19 ()
ffffff00083dba20 zfs:dsl_dataset_rele+42 ()
ffffff00083dba70 zfs:dmu_objset_prefetch+e4 ()
ffffff00083dbaa0 zfs:findfunc+23 ()
ffffff00083dbb80 zfs:dmu_objset_find_spa+38c ()
ffffff00083dbbc0 zfs:dmu_objset_find+40 ()
ffffff00083dbc20 zfs:zfs_ioc_snapshot_list_next+4b ()
ffffff00083dbcc0 zfs:zfsdev_ioctl+347 ()
ffffff00083dbd00 genunix:cdev_ioctl+45 ()
ffffff00083dbd40 specfs:spec_ioctl+5a ()
ffffff00083dbdc0 genunix:fop_ioctl+7b ()
ffffff00083dbec0 genunix:ioctl+18e ()
ffffff00083dbf10 unix:brand_sys_sysenter+1c9 ()
Porting notes:
* Added zfs_max_dataset_nesting module option with documentation.
* Updated zfs_rename_014_neg.ksh for Linux.
* Increase the zfs.sh stack warning to 15K. Enough time has passed
that 16K can be reasonably assumed to be the default value. It
was increased in the 3.15 kernel released in June of 2014.
Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
OpenZFS-issue: https://www.illumos.org/issues/9330
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/757a75aCloses#7681
Remove the dependency on partitionable devices for the clean_mirror
and scrub_mirror test cases. This allows for the setup and cleanup
of the test cases to be simplified by removing the need for complex
partitioning.
This change also resolves a issue where the clean_mirror devices
were not being properly damaged since the device name was not a
full path. The result being loopX files were being left in the
top level test_results directory.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7434Closes#7690
Add a default 4 KiB ashift for Amazon EC2 NVMe devices on instances with
NVMe ephemeral devices, such as the types c5d, f1, i3 and m5d.
As per the official documentation [1] a 4096 byte blocksize should be
used to match the underlying hardware.
The string was identified via:
$ sudo sginfo -M /dev/nvme0n1
INQUIRY response (cmd: 0x12)
----------------------------
Device Type 0
Vendor: NVMe
Product: Amazon EC2 NVMe
Revision level:
$ lsblk -io KNAME,TYPE,SIZE,MODEL
KNAME TYPE SIZE MODEL
nvme0n1 disk 442.4G Amazon EC2 NVMe Instance Storage
[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/
storage-optimized-instances.html
Retrived 2018-07-03
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Troels Nørgaard <tnn@tradeshift.com>
Closes#7676
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#7683
Motivation
==========
The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).
New encoding
============
The layout can be found at space_map.h and it remains backwards compatible with
the old one. The introduced two-word entry format, besides extending the limits
imposed by the single-entry layout, also includes a vdev field and some extra
padding after its prefix.
The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps (expected to be used in the log spacemap project).
One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and we wanted to fit the vdev field
in the space map. In addition that gives us some extra bits in dva_t.
Other references:
=================
The new encoding is also discussed towards the end of the Log Space Map
presentation from 2017's OpenZFS summit.
Link: https://www.youtube.com/watch?v=jj2IxRkl5bQ
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90a56e6d
OpenZFS-issue: https://www.illumos.org/issues/9238Closes#7665
This change introduces a new performance test which does random reads
and writes, but instead of using `bssplit` to determine the block size,
it uses a fixed blocksize. Additionally, some new IO sizes are added to
other tests and timestamp data is recorded with the performance data.
Authored by: Ahmed Gahnem <ahmedg@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Ported-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com>
Requires-builders: perf
OpenZFS-issue: https://www.illumos.org/issues/9184
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/659
External-issue: DLPX-46724
Closes#7660
CID 176037: Uninitialized scalar variable
This patch fixes an uninitialized variable defect caught by
coverity and introduced in 69830602
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7667
The enospc_002_pos test case would frequently fail due a command
succeeding when it was expected to fail due to lack of space.
In order to make this far less likely, files are created across
multiple transaction groups in order to consume as many unused
blocks as possible.
The dependency that the tests run on a partitioned block device
has been removed. It's simpler to use sparse files.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7663
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.
Tested-by: DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7617Closes#7662
This patch fixes two problems with the encryption code. First, the
current code does not correctly prohibit the DMU from updating
dn_maxblkid during object truncation within a raw receive. This
usually only causes issues when the truncating DRR_FREE record is
aggregated with DRR_FREE records later in the receive, so it is
relatively hard to hit.
Second, this patch fixes a security issue where reading blocks
within an encrypted object did not guarantee that the dnode block
itself had ever been verified against its MAC. Usually the
verification happened anyway when the bonus buffer was read, but
some use cases (notably zvols) might never perform the check.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7632
The formatting of the features beginning with large_blocks was broken
when the zpool_checkpoint feature was added.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7658
Add checkpoint field in the default list of the zpool-list man page
Authored by: Eitan Adler <lists@eitanadler.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: kpande <github@tripleback.net>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://illumos.org/issues/9521
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c5a860f7bCloses#7658
Details about the motivation of this feature and its usage can
be found in this blogpost:
https://sdimitro.github.io/post/zpool-checkpoint/
A lightning talk of this feature can be found here:
https://www.youtube.com/watch?v=fPQA8K40jAM
Implementation details can be found in big block comment of
spa_checkpoint.c
Side-changes that are relevant to this commit but not explained
elsewhere:
* renames members of "struct metaslab trees to be shorter without
losing meaning
* space_map_{alloc,truncate}() accept a block size as a
parameter. The reason is that in the current state all space
maps that we allocate through the DMU use a global tunable
(space_map_blksz) which defauls to 4KB. This is ok for metaslab
space maps in terms of bandwirdth since they are scattered all
over the disk. But for other space maps this default is probably
not what we want. Examples are device removal's vdev_obsolete_sm
or vdev_chedkpoint_sm from this review. Both of these have a
1:1 relationship with each vdev and could benefit from a bigger
block size.
Porting notes:
* The part of dsl_scan_sync() which handles async destroys has
been moved into the new dsl_process_async_destroys() function.
* Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write
to block device backed pools.
* ZTS:
* Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg".
* Don't use large dd block sizes on /dev/urandom under Linux in
checkpoint_capacity.
* Adopt Delphix-OS's setting of 4 (spa_asize_inflation =
SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed
its attempts to fill the pool
* Create the base and nested pools with sync=disabled to speed up
the "setup" phase.
* Clear labels in test pool between checkpoint tests to avoid
duplicate pool issues.
* The import_rewind_device_replaced test has been marked as "known
to fail" for the reasons listed in its DISCLAIMER.
* New module parameters:
zfs_spa_discard_memory_limit,
zfs_remove_max_bytes_pause (not documented - debugging only)
vdev_max_ms_count (formerly metaslabs_per_vdev)
vdev_min_ms_count
Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://illumos.org/issues/9166
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7159fdb8Closes#7570
Otherwise the output is consumed by the output redirection.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#7570
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: ajs124 <git@ajs124.de>
Closes#7649
ms_shift can be incorrectly changed changed in MOS config for
indirect vdevs that have been historically expanded
According to spa_config_update() we expect new vdevs to have
vdev_ms_array equal to 0 and then we go ahead and set their metaslab
size. The problem is that indirect vdevs also have vdev_ms_array == 0
because their metaslabs are destroyed once their removal is done.
As a result, if a vdev was expanded and then removed may have its
ms_shift changed if another vdev was added after its removal.
Fortunately this behavior does not cause any type of crash or bad
behavior in the kernel but it can confuse zdb and anyone doing any kind
of analysis of the history of the pools.
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/651
OpenZFS-issue: https://illumos.org/issues/9591a
External-issue: DLPX-58879
Closes#7644
For zio taskq's which have multiple instances (e.g. z_rd_int_0,
z_rd_int_1, etc), each one has a unique name (the _0, _1, _2 suffix).
This makes performance analysis more difficult, because by default,
`perf` includes the thread name (which is the same as the taskq name) in
the stack trace. This means that we get 8 different stacks, all of
which are doing the same thing, but are executed from different taskq's.
We should remove the suffix of the taskq name, so that all the
read-interrupt threads are named z_rd_int.
Note that we already support multiple taskq's with the same name. This
happens when there are multiple pools. In this case the taskq has a
different tq_instance, which shows up in /proc/spl/taskq-all.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#7646
Adopt and extend the OpenZFS ZTS results analysis script for use
with ZFS on Linux. This allows for automatic analysis of tests
which may be skipped for a variety or reasons or which are not
entirely reliable.
In addition to the list of 'known' failures, which have been updated
for ZFS on Linux, there in a new 'maybe' section. This mapping
include tests which might be correctly skipped depending on the
test environment. This may be because of a missing dependency or
lack of required kernel support. This list also includes tests
which normally pass but might on occasion fail for a harmless
reason.
The script was also extended include a reason for why a given test
might be skipped or may fail. The reason will be included after
the test in the "results other than PASS that are expected" section.
For failures it is preferable to set the reason to the GitHub issue
number and for skipped tests several generic reasons are available.
You may also specify a custom reason if needed.
All tests were added back in to the linux.run file even if they are
expected to failed. There is value in running tests which may not
pass, the expected results for these tests has been encoded in
the new analysis script.
All tests which were disabled because they ran more slowly on a
32-bit system have been re-enabled. Developers working on 32-bit
systems should assess what it reasonable for their environment.
The unnecessary dependency on physical block devices was removed for
the checksum, grow_pool, and grow_replicas test groups so they are
no longer skipped. Updated the filetest_001_pos test case to run
properly now that it is enabled and moved the grow tests in to a
single directory.
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7638
The blk_queue_stackable() function was replaced in the 4.14 kernel
by queue_is_rq_based(), commit torvalds/linux@5fdee212. This change
resulted in the default elevator being used which can negatively
impact performance.
Rather than adding additional compatibility code to detect the
new interface unconditionally attempt to set the elevator. Since
we expect this to fail for block devices without an elevator the
error message has been moved in to zfs_dbgmsg().
Finally, it was observed that the elevator_change() was removed
from the 4.12 kernel, commit torvalds/linux@c033269. Update the
comment to clearly specify which are expected to export the
elevator_change() symbol.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7645
Commit torvalds/linux@95582b0 changes the inode i_atime, i_mtime,
and i_ctime members form timespec's to timespec64's to make them
2038 safe. As part of this change the current_time() function was
also updated to return the timespec64 type.
Resolve this issue by introducing a new inode_timespec_t type which
is defined to match the timespec type used by the inode. It should
be used when working with inode timestamps to ensure matching types.
The timestruc_t type under Illumos was used in a similar fashion but
was specified to always be a timespec_t. Rather than incorrectly
define this type all timespec_t types have been replaced by the new
inode_timespec_t type.
Finally, the kernel and user space 'sys/time.h' headers were aligned
with each other. They define as appropriate for the context several
constants as macros and include static inline implementation of
gethrestime(), gethrestime_sec(), and gethrtime().
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7643
This patch simply adds an ASSERT that confirms that the last
decrypting reference on a dataset waits until the dataset is
no longer dirty. This should help to debug issues where the
ZIO layer cannot find encryption keys after a dataset has been
disowned.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7637
The recent SPL merge caused a regression in kernels with ZFS integrated
into the sources where our modules would be initialized in alphabetical
order, despite icp requiring the spl module be loaded first. This caused
kernels with ZFS builtin to fail to boot.
We resolve this by adding a special case for the spl that lists it
first. It is somewhat ugly, but it works.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#7595Closes#7606
This patch adds tunables for modifying the maximum memory limit and
maximum instruction limit that can be specified when running a channel
program.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1085
Closes#7618
Added support for the bops->check_events() interface which was
added in the 2.6.38 kernel to replace bops->media_changed().
Fully implementing this functionality allows the volume resize
code to rely on revalidate_disk(), which is the preferred
mechanism, and removes the need to use check_disk_size_change().
In order for bops->check_events() to lookup the zvol_state_t
stored in the disk->private_data the zvol_state_lock needs to
be held. Since the check events interface may poll the mutex
has been converted to a rwlock for better concurrently. The
rwlock need only be taken as a writer in the zvol_free() path
when disk->private_data is set to NULL.
The configure checks for the block_device_operations structure
were consolidated in a single kernel-block-device-operations.m4
file.
The ZFS_AC_KERNEL_BDEV_BLOCK_DEVICE_OPERATIONS configure checks
and assoicated dead code was removed. This interface was added
to the 2.6.28 kernel which predates the oldest supported 2.6.32
kernel and will therefore always be available.
Updated maximum Linux version in META file. The 4.17 kernel
was released on 2018-06-03 and ZoL is compatible with the
finalized kernel.
Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com>
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7611
Reserve bit 25 for the ZSTD compression feature from FreeBSD.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes#7626