Commit 98bb45e resolved a deadlock which could occur when
handling a page fault in zfs_write(). This change added
the uio_fault_disable field to the uio structure but failed
to initialize it to B_FALSE. This uninitialized field would
cause uiomove_iov() to call __copy_from_user_inatomic()
instead of copy_from_user() resulting in unexpected EFAULTs.
Resolve the issue by fully initializing the uio, and clearing
the uio_fault_disable flags after it's used in zfs_write().
Additionally, reorder the uio_t field assignments to match
the order the fields are declared in the structure.
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8640Closes#8719
Exported and documented a new module parameter.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes#8706
When receiving a DRR_OBJECT record the receive_object() function
needs to determine how to handle a spill block associated with the
object. It may need to be removed or kept depending on how the
object was modified at the source.
This determination is currently accomplished using a heuristic which
takes in to account the DRR_OBJECT record and the existing object
properties. This is a problem because there isn't quite enough
information available to do the right thing under all circumstances.
For example, when only the block size changes the spill block is
removed when it should be kept.
What's needed to resolve this is an additional flag in the DRR_OBJECT
which indicates if the object being received references a spill block.
The DRR_OBJECT_SPILL flag was added for this purpose. When set then
the object references a spill block and it must be kept. Either
it is update to date, or it will be replaced by a subsequent DRR_SPILL
record. Conversely, if the object being received doesn't reference
a spill block then any existing spill block should always be removed.
Since previous versions of ZFS do not understand this new flag
additional DRR_SPILL records will be inserted in to the stream.
This has the advantage of being fully backward compatible. Existing
ZFS systems receiving this stream will recreate the spill block if
it was incorrectly removed. Updated ZFS versions will correctly
ignore the additional spill blocks which can be identified by
checking for the DRR_SPILL_UNMODIFIED flag.
The small downside to this approach is that is may increase the size
of the stream and of the received snapshot on previous versions of
ZFS. Additionally, when receiving streams generated by previous
unpatched versions of ZFS spill blocks may still be lost.
OpenZFS-issue: https://www.illumos.org/issues/9952
FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8668
`zfs set atime|relatime=off|on` doesn't disable or enable the property
on read for datasets whose property was inherited from parent, until
a dataset is once unmounted and mounted again.
(The properties start to work properly if a dataset is once unmounted
and mounted again. The difference comes from regular mount process,
e.g. via zpool import, uses mount options based on properties read
from ondisk layout for each dataset, whereas
`zfs set atime|relatime=off|on` just remounts a specified dataset.)
--
# zpool create p1 <device>
# zfs create p1/f1
# zfs set atime=off p1
# echo test > /p1/f1/test
# sync
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
p1 176K 18.9G 25.5K /p1
p1/f1 26K 18.9G 26K /p1/f1
# zfs get atime
NAME PROPERTY VALUE SOURCE
p1 atime off local
p1/f1 atime off inherited from p1
# stat /p1/f1/test | grep Access | tail -1
Access: 2019-04-26 23:32:33.741205192 +0900
# cat /p1/f1/test
test
# stat /p1/f1/test | grep Access | tail -1
Access: 2019-04-26 23:32:50.173231861 +0900
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ changed by read(2)
--
The problem is that zfsvfs::z_atime which was probably intended to keep
incore atime state just gets updated by a callback function of "atime"
property change, atime_changed_cb(), and never used for anything else.
Since now that all file read and atime update use a common function
zpl_iter_read_common() -> file_accessed(), and whether to update atime
via ->dirty_inode() is determined by atime_needs_update(),
atime_needs_update() needs to return false once atime is turned off.
It currently continues to return true on `zfs set atime=off`.
Fix atime_changed_cb() by setting or dropping SB_NOATIME in VFS super
block depending on a new atime value, so that atime_needs_update() works
as expected after property change.
The same problem applies to "relatime" except that a self contained
relatime test is needed. This is because relatime_need_update() is based
on a mount option flag MNT_RELATIME, which doesn't exist in datasets
with inherited "relatime" property via `zfs set relatime=...`, hence it
needs its own relatime test zfs_relatime_need_update().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8674Closes#8675
Drop duplicated phrases in comments.
Also drop an obsolete comment "Perform a mount of the associated...",
as all it does now is get objid from DMU and lookup incore inode.
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8707
Linux kernel commit ca79b0c211af63fa3276f0e3fd7dd9ada2439839
"mm: convert totalram_pages and totalhigh_pages variables to atomic"
replaced `totalhigh_pages` with an inline function `totalhigh_pages()`.
This broke compilation on IA32, etc, as ZoL uses `totalhigh_pages`
on archs with highmem. Confirmed on Fedora 30 (5.0.9-301.fc30.i686).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8677Closes#8701
The kernel function which adds new zvols as disks to the system,
add_disk(), briefly opens and closes the zvol as part of its work.
Closing a zvol involves waiting for two txgs to sync. This, combined
with the fact that the taskq processing new zvols is single threaded,
makes this processing new zvols slow.
Waiting for these txgs to sync is only necessary if the zvol has been
written to, which is not the case during add_disk(). This change adds
tracking of whether a zvol has been written to so that we can skip the
txg_wait_synced() calls when they are unnecessary.
This change also fixes the flags passed to blkdev_get_by_path() by
vdev_disk_open() to be FMODE_READ | FMODE_WRITE | FMODE_EXCL instead of
just FMODE_EXCL. The flags were being incorrectly calculated because
we were using the wrong version of vdev_bdev_mode().
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes#8526Closes#8615
The comment in lz4_compress_zfs could be more clear and specific. It
also contains needlessly strong language.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes: #8702Closes: #8703
The 'zpool resilver' command requires that the resilver_defer
feature is active on the pool. Unfortunately, the check for
this was left out of the original patch. This commit simply
corrects this so that the command properly returns an error
in this case.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8700
The size argument of snprintf(3) in glibc and snprintf() in Linux
kernel includes trailing \0, as snprintf(3) man page explains it as
"write at most size bytes (including the trailing null byte ('\0'))",
i.e. snprintf() can just take buffer size.
e.g. For snprintf() in module/zfs/zfs_ctldir.c, a buffer size is
MAXPATHLEN, and a caller is passing MAXPATHLEN to snprintf(), so size
should just be `path_len` to do what the caller is trying to do.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8692
Not all block devices, notably scsi_debug, set a root_blkg on the
request queue. Remove this assertion and allow the the existing
call to blkg_tryget() to gracefully handle the NULL (which it does).
Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8678
Use NV_ENCODE_NATIVE for nvlist encoding variable instead of 0.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8653
Use either SEEK_* or 0,1,2..., but not both.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8656
This patch fixes 2 issues with the DMU free throttle implemented
in dmu_free_long_range(). The first issue is that get_next_chunk()
was calculating the number of L1 blocks the free would dirty
incorrectly. In some cases involving extremely large files, this
code would greatly overestimate the number of effected L1 blocks,
causing excessive calls to txg_wait_open(). This patch corrects
the calculation.
The second issue is that the free throttle uses the total number
of free'd blocks in all (open, quiescing, and syncing) txgs to
determine whether to throttle. This causes large frees (such as
those created by the first issue) to cause 4 txg syncs before
any further frees were allowed to proceed. This patch ensures
that the accounting is done entirely in a per-txg fashion, so
that frees from a given txg don't affect those that immediately
follow it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8655
Unused since 5649246dd3("Remove znode move functionality"),
and ZNODE_STAT_ADD() will never be needed.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8636
These aren't unused.
`flag` in zfs_create() also isn't to indicate large file.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8635
1. Support QAT when ZFS is root file-system:
When ZFS module is loaded before QAT started, the QAT can
be started again in post-process, e.g.:
echo 0 > /sys/module/zfs/parameters/zfs_qat_compress_disable
echo 0 > /sys/module/zfs/parameters/zfs_qat_encrypt_disable
echo 0 > /sys/module/zfs/parameters/zfs_qat_checksum_disable
2. Verify alder checksum of the de-compress result
3. Allocate Digest, IV and AAD buffer in physical contiguous
memory by QAT_PHYS_CONTIG_ALLOC.
4. Update the documentation for zfs_qat_compress_disable,
zfs_qat_checksum_disable, zfs_qat_encrypt_disable.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com>
Closes#8323Closes#8610
This replaces empty for loops with while loops to make the code easier
to read.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: github.com/dcb314
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#6681Closes#6682Closes#6683Closes#8623
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#8626
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#8626
When receiving a raw send stream only reallocated objects
whose contents were not freed by the standard indicators
should call dmu_free_long_range().
Furthermore, if calling dmu_free_long_range() is required
then the objects current block size must be used and not
the new block size.
Two additional test cases were added to provided realistic
test coverage for processing reallocated objects which are
part of a raw receive.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8528Closes#8607
This patch simply up cleans up a nit and corrects an error message
issue that were introduced in the Multiple DVA scrub patch.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8619
When receiving an object to a previously allocated interior slot
the new object should be "allocated" by setting DMU_NEW_OBJECT,
not "reallocated" with dnode_reallocate(). For resilience verify
the slot is free as required in case the stream is malformed.
Add a test case to generate more realistic incremental send streams
that force reallocation to occur during the receive.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8067Closes#8614
Fix style issue for 'tx->tx_txg&TXG_MASK'. There should be white
space around the '&' character. Split the dnode_reallocate() ASSERT
to make it more readable to clearly separate the checks.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8606
The error path in zio_crypt_key_unwrap would call zio_crypt_key_destroy which
calls rw_destroy(&key->zk_salt_lock); which has not yet been initialized.
We move the rw_init() call to the start of zio_crypt_key_unwrap instead.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#8604Closes#8605
The bulk[] array index, count, must be reset per-iteration in order to
not overwrite the stack.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#8072Closes#8597Closes#8601
d12614521a("Fixes for procfs files backed by linked lists")
uses PDE_DATA(), but since PDE_DATA() (public interface which
replaced old public interface PDE()) first appeared in upstream
kernel 3.10, it lacks visible local definition for kernel < 3.10.
Move the local PDE_DATA() definition to a ZoL header, to unbreak
build on kernel < 3.10.
--
module/spl/spl-procfs-list.c: In function 'procfs_list_open':
module/spl/spl-procfs-list.c:166: error: implicit declaration of function 'PDE_DATA'
module/spl/spl-procfs-list.c:166: warning: assignment makes pointer from integer without a cast
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes#8599
This partially reverts commit 5dbf8b4ed. This change resolved
the issues observed with truncated files in raw sends. However,
the required changes to dnode_allocate() introduced a regression
for non-raw streams which needs to be understood.
The additional debugging improvements from the original patch
were not reverted.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7378
Issue #8528
Issue #8540
Issue #8565Close#8584
When a pool is initially created (by `zpool create`), predictive
prefetch is inadvertently disabled, until the pool is export/import-ed,
or the machine is rebooted.
When device removal was introduced, we added some code to disable
predictive prefetching until indirect vdevs have been loaded. This
resulted in the "default state" of prefetch being disabled, until we
proactively enable it after indirect vdevs are loaded. Unfortunately
this resulted in a few bugs where in some code paths we neglect to
enable predictive prefetch. The first of these was fixed by
20507534d4
This commit fixes another case where we also need to explicitly enable
predictive prefetch, when the pool is initially created.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#8577
The features.kernel layout should match features.pool.
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#8566
There are several places where we use zfs_dbgmsg and %p to
print pointers. In the Linux kernel, these values obfuscated
to prevent information leaks which means the pointers aren't
very useful for debugging crash dumps. We decided to restrict
the permissions of dbgmsg (and some other kstats while we were
at it) and print pointers with %px in zfs_dbgmsg as well as
spl_dumpstack
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
Closes#8467Closes#8476
Callers of txg_wait_open() which set should_quiesce=B_TRUE should be
accounted for as iowait time. Otherwise, the caller is understood
to be idle and cv_wait_sig() is used to prevent incorrectly inflating
the system load average.
Similarly txg_wait_wait() has been updated to use cv_wait_io() to
be accounted against iowait.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8550Closes#8558
UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends. By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more efficiently manage itself.
This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool. The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience. The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.
The zio pipeline was updated to accommodate this by adding a new
ZIO_TYPE_TRIM type and associated spa taskq. This new type makes
is straight forward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c. These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This makes it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.
In addition to the manual `zpool trim` command, a background
automatic TRIM was added and is controlled by the 'autotrim'
property. It relies on the exact same infrastructure as the
manual TRIM. However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab. When 'autotrim=on', ranges added back to the
ms_allocatable tree are also added to the ms_free tree. The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.
Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`. This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them. An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Contributions-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Contributions-by: Tim Chase <tim@chase2k.com>
Contributions-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8419Closes#598
This patch fixes a few issues with raw receives involving
truncated files:
* dnode_reallocate() now calls dnode_set_blksz() instead of
dnode_setdblksz(). This ensures that any remaining dbufs with
blkid 0 are resized along with their containing dnode upon
reallocation.
* One of the calls to dmu_free_long_range() in receive_object()
needs to check that the object it is about to free some contents
or hasn't been completely removed already by a previous call to
dmu_free_long_object() in the same function.
* The same call to dmu_free_long_range() in the previous point
needs to ensure it uses the object's current block size and
not the new block size. This ensures the blocks of the object
that are supposed to be freed are completely removed and not
simply partially zeroed out.
This patch also adds handling for DRR_OBJECT_RANGE records to
dprintf_drr() for debugging purposes.
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7378Closes#8528
Make a local copy of the vd_path and preserve the removal error
for use in spa_history_log_internal(). This is required because
after spa_vdev_exit() there is nothing preventing the vdev state
from changing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#8522
Added missing remove of detachable VDEV from txg's DTL list
to avoid use-after-free for the split VDEV
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com>
Closes#5565Closes#7856
ZFS supports O_RSYNC for read operations and when specified will ensure
the same level of data integrity that O_DSYNC and O_SYNC provides for
writes. O_RSYNC by itself has no effect so it must be combined with
either O_DSYNC or O_SYNC. However, many platforms don't support O_RSYNC
and have mapped O_SYNC to mean O_RSYNC within ZFS. This is incorrect
and causes unnecessary calls to zil_commit. Only platforms which
support O_RSYNC should implement the zil_commit functionality in the
read code path.
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes#8523
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test. This value, is not enough information.
If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:
zfs_multihost_fail_intervals * zfs_multihost_interval
and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.
This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence. This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.
It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses
(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals
Which results in an activity test duration less sensitive to the leaf
count.
In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
in between syncs occur. This allows an importing host to detect MMP
on the remote host sooner, when the pool is idle, as it is not limited
to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
that sequence number is incrementing, and that uberblock values match
tunable settings.
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#7842
In addition to dsl_dataset_evict_async() releasing a hold, there is
an error case in dsl_dataset_hold_obj() which had missed 4 additional
release calls. This was introduced in a1d477c24.
openzfsonosx-commit: https://github.com/openzfsonosx/zfs/commit/63ff7f1c
Authored by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#8517
If the buffer 'digest_buffer' is allocated in the qat_checksum()
stack, it can't ensure that the address is physically contiguous,
and the DMA result of the buffer may be handled incorrectly.
Using QAT_PHYS_CONTIG_ALLOC() ensures a physically
contiguous allocation.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chengfei, Zhu <chengfeix.zhu@intel.com>
Closes#8323Closes#8521
Update the dirty check in dmu_offset_next() such that dnode's
are only considered dirty for the purpose or reporting holes
when there are pending data blocks or frees to be synced. This
ensures that when there are only metadata updates to be synced
(atime) that holes are reported.
Reviewed-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6958Closes#8505
As it turns out, on the Windows platform when rw_init() is called
(rather its bedrock call ExInitializeResourceLite) it is placed on
an active-list of locks, and is removed at rw_destroy() time.
dnode_move() has logic to copy over the old-dnode to new-dnode,
including calling dmu_zfetch_init(new-dnode). But due to the missing
dmu_zfetch_fini(old-dnode), kmem will call dnode_dest() to release the
memory (and in debug builds fill pattern 0xdeadbeef) over the Windows
active-lock's prev/next list pointers, making Windows sad.
But on other platforms, the contents of dmu_zfetch_fini() is one
call to list_destroy() and one to rw_destroy(), which is effectively
a no-op call and is not required. This commit is mostly for
"correctness" and can be skipped there.
Porting Notes:
* This leak exists on Linux but currently can never happen because
the dnode_move() functionality is not supported.
openzfsonosx-commit: openzfsonosx/zfs@d95fe517
Authored by: Julian Heuking <JulianH@beckhoff.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#8519
When destroying an arc_buf_hdr_t its identity cannot be discarded
until it is entirely undiscoverable. This not only includes being
unhashed, but also being removed from the l2arc header list.
Discarding the header's identify prematurely renders the hash
lock useless because it will always hash to bucket zero.
This change resolves a race with l2arc_evict() by discarding the
identity after it has been removed from the l2arc header list.
This ensures either the header is not on the list or contains
the correct identify.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7688Closes#8144
Currently, there is an issue in the sequential scrub code which
prevents self healing from working in some cases. The scrub code
will split up all DVA copies of a bp and issue each of them
separately. The problem is that, since each of the DVAs is no
longer associated with the others, the self healing code doesn't
have the opportunity to repair problems that show up in one of the
DVAs with the data from the others.
This patch fixes this issue by ensuring that all IOs issued by the
sequential scrub code include all DVAs. Initially, only the first
DVA of each is attempted. If an issue arises, the IO is retried
with all available copies, giving the self healing code a chance
to correct the issue.
To test this change, this patch also adds the ability for zinject
to specify individual DVAs to inject read errors into. We then
add a new test case that utilizes this functionality to ensure
scrubs and self-healing reads can handle and transparently fix
issues with individual copies of blocks.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8453
The number of IO and checksum events should match the number of errors
seen in zpool status. Previously there was a mismatch between the
two counts because zpool status would only count unrecovered errors,
while zpool events would get an event for *all* errors (recovered or
not). This lead to situations where disks could be faulted for
"too many errors", while at the same time showing zero errors in zpool
status.
This fixes the zpool status error counters to increment at the same
times we post the error events.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#4851Closes#7817
This patch simply fixes some small memory leaks that can happen
during error handling in zfsvfs_create_impl(). If the function
fails, it frees all the memory / references it created.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#8490
This patch attempts to address some user concerns that have arisen
since errata 4 was introduced.
* The errata warning has been made less scary for users without
any encrypted datasets.
* The errata warning now clears itself without a pool reimport if
the bookmark_v2 feature is enabled and no encrypted datasets
exist.
* It is no longer possible to create new encrypted datasets without
enabling the bookmark_v2 feature, thus helping to ensure that the
errata is resolved.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue ##8308
Closes#8504
Before sequential scrub patches ZFS never aggregated I/Os above 128KB.
Sequential scrub bumped that to 1MB, supposedly to reduce number of
head seeks for spinning disks. But for SSDs it makes little to no
sense, especially on FreeBSD, where due to MAXPHYS limitation device
will likely still see bunch of 128KB I/Os instead of one large.
Having more strict aggregation limit for SSDs allows to avoid
allocation of large memory buffer and copy to/from it, that is a
serious problem when throughput reaches gigabytes per second.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes#8494