filetest_001_pos verifies that various checksum algorithms detect
corruption by overwriting the underlying vdev on which a file resides.
It is possible for the overwrite to miss the blocks of a file, causing a
spurious failure. This change introduces a function to corrupt the
individual blocks of a file as determined by zdb.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: John Kennedy <john.kennedy@delphix.com>
Closes#9288
Get rid of the `get_used_prop` function. `get_prop used` works fine.
Fix the comment describing the function parameters. The type does not
have a default, and mntp is also used for ext2.
Rename the variable for the number of copies from `copy` to `copies`.
Use a `case` statement to match the type parameter, order the cases
alphabetically, and add a little sanity checking for good measure.
Use eval to make sure the output of commands is silenced rather than
the log messages when redirecting output to /dev/null.
Simplify cases where zfs requires special behavior.
Don't allow the test to loop forever in the event space usage does not
change. Bail out of the loop and fail after an arbitrary number of
iterations.
Add more information to the log message when the test fails, to help
debugging.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9286
Currently, the noop receive code fails to work with raw send streams
and resuming send streams. This happens because zfs_receive_impl()
reads the DRR_BEGIN payload without reading the payload itself.
Normally, the kernel expects to read this itself, but in this case
the recv_skip() code runs instead and it is not prepared to handle
the stream being left at any place other than the beginning of a
record.
This patch resolves this issue by manually reading the DRR_BEGIN
payload in the dry-run case. This patch also includes a number of
small fixups in this code path.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#9221Closes#9173
Remove a lot of unnecessary setting and incrementing of `i`.
Remove unused variable `j`.
Instead of calling out to Python in a loop to generate the same string
repeatedly, generate the string once using shell constructs before
entering the loop.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9284
md5sum in particular but also sha256sum to a lesser extent is used
in several areas of the test suite for computing checksums. The vast
majority of invocations are followed by `| awk '{ print $1 }'`.
Introduce functions to wrap up `md5sum $file | awk '{ print $1 }'` and
likewise for sha256sum. These also serve as a convenient interface for
alternative implementations on other platforms.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9280
TRUE and FALSE happen to be defined, but we should use B_TRUE and
B_FALSE for the sake of consistency.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9264
Account for ZFS_MAX_DATASET_NAME_LEN in kstat data size. This value
is ignored in the Linux kstat code but resolves the issue for other
platforms.
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#9254Closes#9151
Create a larger file to extend the time required to perform the
removal. Occasional failures were observed due to the removal
completing before the cancel could be requested.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#9259
When running on larger memory systems, we can overflow the value of
maxinflight. This can result in maxinflight having a value of 0 causing
the system to hang.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes#9272
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9251
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9250
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9249
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9247
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9246
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9244
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9243
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9242
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9240
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9237
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9248
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9241
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9239
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9238
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9236
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9235
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9234
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9233
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9232
Must use 'zfs' instead of '$ZFS' which is undefined.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes#9257
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9185Closes#9186Closes#9231Closes#9253
Eliminate unnecessary code duplication. We can use a for-loop instead
of a while-loop. There is no need to echo $DISKSARRAY in a subshell or
return 0. Declare all variables with typeset.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9224
BSD getopt() and getopt_long() want options before arguments.
Reorder arguments to zfs/zpool in tests to put all the options first.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9228
For interrupt coalescing, cv_timedwait_hires() uses a 100us slack/delta
for calls to schedule_hrtimeout_range(). This 100us slack can be costly
for small writes.
This change improves small write performance by passing resolution `res`
parameter to schedule_hrtimeout_range() to be used as delta/slack. A new
tunable `spl_schedule_hrtimeout_slack_us` is added to preserve old
behavior when desired.
Performance observations on 8K recordsize filesystem:
- 8K random writes at 1-64 threads, up to 60% improvement for one thread
and smaller gains as thread count increases. At >64 threads, 2-5%
decrease in performance was observed.
- 8K sequential writes, similar 60% improvement for one thread and
leveling out around 64 threads. At >64 threads, 5-10% decrease in
performance was observed.
- 128K sequential write sees 1-5 for the 128K. No observed regression at
high thread count.
Testing done on Ubuntu 18.04 with 4.15 kernel, 8vCPUs and SSD storage on
VMware ESX.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
Closes#9217
Defining a special constant to make an infinite loop is excessive,
especially when the name clashes with symbols commonly defined on
some platforms (ie FreeBSD).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9219
Other than this test, zpool list -p is not well tested by any of the
automated tests. Add a test for zpool list -p.
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9134
Split the arguments for ${TEST_RUNNER} across multiple lines for
clarity. Also added quotes in the message to match the invoked command.
Unquoted variables in argument lists are subject to splitting. In this
particular case we can't quote the variable because it is an optional
argument. Use the method suggested in the description linked below,
instead.
The technique is to use an unquoted variable with an alternate value.
https://github.com/koalaman/shellcheck/wiki/SC2086
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9212
Commit a887d653 updated the dbufstats such that escalated privileges
are required. Since all tests under cli_user are run with normal
privileges move this test case to a location where it will be run
required privileges.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9118Closes#9196
Document the ZFS_DKMS_ENABLE_DEBUGINFO option in the userland
configuration file, as done with the other ZFS_DKMS_* options.
It has been introduced with commit e45c1734a6 ("dkms: Enable
debuginfo option to be set with zfs sysconfig file") but isn't
mentioned anywhere other than the 'dkms.conf' file (generated).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mauricio Faria de Oliveira <mfo@canonical.com>
Closes#9191
Reuse enum value ZFS_IOC_BASE for `('Z' << 8)`.
This is helpful on FreeBSD where ZFS_IOC_BASE has a different value and
`('Z' << 8)` is wrong.
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9188
When checking ZFS_IOC_* numbers, print which numbers are wrong rather
than silently failing.
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9187
The ancient version of blkid (v2.17.2) used in CentOS 6 will not
detect the newly created pool unless it has been written to.
Force a pool sync so `zpool import` will detect the newly created
pool.
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9199
Split long lines where adding license info to dist archive.
Remove extra colon from target line.
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes#9189
$fs used with the wrong sed command where should be $mntpnt instead
to match a variable exported by read_mtab()
The fix is mostly to reuse the sed command found in read_mtab()
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Alexey Smirnoff <fling@member.fsf.org>
Closes#9168
There are two different deadlock scenarios, but they share a common
link, which is
thread 1 holding sa_lock and trying to get zap->zap_rwlock:
zap_lockdir_impl+0x858/0x16c0 [zfs]
zap_lockdir+0xd2/0x100 [zfs]
zap_lookup_norm+0x7f/0x100 [zfs]
zap_lookup+0x12/0x20 [zfs]
sa_setup+0x902/0x1380 [zfs]
zfsvfs_init+0x3d6/0xb20 [zfs]
zfsvfs_create+0x5dd/0x900 [zfs]
zfs_domount+0xa3/0xe20 [zfs]
and thread 2 trying to get sa_lock, either in sa_setup:
sa_setup+0x742/0x1380 [zfs]
zfsvfs_init+0x3d6/0xb20 [zfs]
zfsvfs_create+0x5dd/0x900 [zfs]
zfs_domount+0xa3/0xe20 [zfs]
or in sa_build_index:
sa_build_index+0x13d/0x790 [zfs]
sa_handle_get_from_db+0x368/0x500 [zfs]
zfs_znode_sa_init.isra.0+0x24b/0x330 [zfs]
zfs_znode_alloc+0x3da/0x1a40 [zfs]
zfs_zget+0x39a/0x6e0 [zfs]
zfs_root+0x101/0x160 [zfs]
zfs_domount+0x91f/0xea0 [zfs]
From there, there are different locking paths back to something
holding zap->zap_rwlock.
The deadlock scenarios involve multiple different ZFS filesystems
being mounted. sa_lock is common to these scenarios, and the sa
struct involved is private to a mount. Therefore, these must be
referring to different sa_lock instances and these deadlocks can't
occur in practice.
The fix, from Brian Behlendorf, is to remove sa_lock from lockdep
coverage by initializing it with MUTEX_NOLOCKDEP.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes#9110
Existing zfs initramfs script logic will attempt to set the 'noop'
scheduler if it's available on the vdev block devices. Newer kernels
have the similar 'none' scheduler on multiqueue devices; this change
alters the initramfs script logic to also attempt to set this scheduler
if it's available.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Garrett Fields <ghfields@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Colm Buckley <colm@tuatha.org>
Closes#9042
It used to be possible for zfs receive (and other operations related
to clone swap) to bypass refquotas. This can cause a number of issues,
and there should be an automated test for it.
Added tests for rollback and receive not overriding refquota.
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#9139
* contrib/initramfs: include /etc/default/zfs and /etc/zfs/zfs-functions
At least debian needs /etc/default/zfs and /etc/zfs/zfs-functions for
its initramfs. Include both in build when initramfs is configured.
* contrib/initramfs: include 60-zvol.rules and zvol_id
Include 60-zvol.rules and zvol_id and set udev as predependency instead
of debians zdev. This makes debians additional zdev hook unneeded.
* Correct initconfdir substitution for some distros
Not every Linux distro is using @sysconfdir@/default but @initconfdir@
which is already determined by configure. Let's use it.
* systemd: prevent possible conflict between systemd and sysvinit
Systemd will not load a sysvinit service if a unit exists with the same
name. This prevents conflicts between sysvinit and systemd.
In ZFS there is one sysvinit service that does not have a systemd
service but a target counterpart, zfs-import.target.
Usually it does not make any sense to install both but it is possisble.
Let's prevent any conflict by masking zfs-import.service by default.
This does not harm even if init.d/zfs-import does not exist.
Reviewed-by: Chris Wedgwood <cw@f00f.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Tested-by: Alex Ingram <reimu@reimuhakurei.net>
Tested-by: Dreamcat4 <dreamcat4@gmail.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes#7904Closes#9089
Even though the bug's writeup (Github issue #9136) is very detailed,
we still don't know exactly how we got to that state, thus I wasn't
able to reproduce the bug. That said, we can make an educated guess
combining the information on filled issue with the code.
From the fact that `dp_dirty_total` was 0 (which is less than
`zfs_dirty_data_max`) we know that there was one thread that set it to
0 and then signaled one of the waiters of `dp_spaceavail_cv` [see
`dsl_pool_dirty_delta()` which is also the only place that
`dp_dirty_total` is changed]. Thus, the only logical explaination
then for the bug being hit is that the waiter that just got awaken
didn't go through `dsl_pool_dirty_data()`. Given that this function
is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()`
I can only think of two possible ways of the above scenario happening:
[1] The waiter didn't call into any of the two functions - which I
find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin
with?).
[2] The waiter did call in one of the above function but it passed 0 as
the space/delta to be dirtied (or undirtied) and then the callee
returned immediately (e.g both `dsl_pool_dirty_space()` and
`dsl_pool_undirty_space()` return immediately when space is 0).
In any case and no matter how we got there, the easy fix would be to
just broadcast to all waiters whenever `dp_dirty_total` hits 0. That
said and given that we've never hit this before, it would make sense
to think more on why the above situation occured.
Attempting to mimic what Prakash was doing in the issue filed, I
created a dataset with `sync=always` and started doing contiguous
writes in a file within that dataset. I observed with DTrace that even
though we update the pool's dirty data accounting when we would dirty
stuff, the accounting wouldn't be decremented incrementally as we were
done with the ZIOs of those writes (the reason being that
`dbuf_write_physdone()` isn't be called as we go through the override
code paths, and thus `dsl_pool_undirty_space()` is never called). As a
result we'd have to wait until we get to `dsl_pool_sync()` where we
zero out all dirty data accounting for the pool and the current TXG's
metadata.
In addition, as Matt noted and I later verified, the same issue would
arise when using dedup.
In both cases (sync & dedup) we shouldn't have to wait until
`dsl_pool_sync()` zeros out the accounting data. According to the
comment in that part of the code, the reasons why we do the zeroing,
have nothing to do with what we observe:
````
/*
* We have written all of the accounted dirty data, so our
* dp_space_towrite should now be zero. However, some seldom-used
* code paths do not adhere to this (e.g. dbuf_undirty(), also
* rounding error in dbuf_write_physdone).
* Shore up the accounting of any dirtied space now.
*/
dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg);
````
Ideally what we want to do is to undirty in the accounting exactly what
we dirty (I use the word ideally as we can still have rounding errors).
This would make the behavior of the system more clear and predictable.
Another interesting issue that I observed with DTrace was that we
wouldn't update any of the pool's dirty data accounting whenever we
would dirty and/or undirty MOS data. In addition, every time we would
change the size of a dbuf through `dbuf_new_size()` we wouldn't update
the accounted space dirtied in the appropriate dirty record, so when
ZIOs are done we would undirty less that we dirtied from the pool's
accounting point of view.
For the first two issues observed (sync & dedup) this patch ensures
that we still update the pool's accounting when we undirty data,
regardless of the write being physical or not.
For changes in the MOS, we first ensure to zero out the pool's dirty
data accounting in `dsl_pool_sync()` after we synced the MOS. Then we
can go ahead and enable the update of the pool's dirty data accounting
wheneve we change MOS data.
Another fix is that we now update the accounting explicitly for
counting errors in `dbuf_write_done()`.
Finally, `dbuf_new_size()` updates the accounted space of the
appropriate dirty record correctly now.
The problem is that we still don't know how the bug came up in the
issue filled. That said the issues fixed seem to be very relevant, so
instead of going with the broadcasting solution right away,
I decided to leave this patch as is.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
External-issue: DLPX-47285
Closes#9137