mirror_zfs/module/zfs
Serapheim Dimitropoulos dfa4d3d986 dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta()
Even though the bug's writeup (Github issue #9136) is very detailed,
we still don't know exactly how we got to that state, thus I wasn't
able to reproduce the bug. That said, we can make an educated guess
combining the information on filled issue with the code.

From the fact that `dp_dirty_total` was 0 (which is less than
`zfs_dirty_data_max`) we know that there was one thread that set it to
0 and then signaled one of the waiters of `dp_spaceavail_cv` [see
`dsl_pool_dirty_delta()` which is also the only place that
`dp_dirty_total` is changed].  Thus, the only logical explaination
then for the bug being hit is that the waiter that just got awaken
didn't go through `dsl_pool_dirty_data()`. Given that this function
is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()`
I can only think of two possible ways of the above scenario happening:

[1] The waiter didn't call into any of the two functions - which I
    find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin
    with?).
[2] The waiter did call in one of the above function but it passed 0 as
    the space/delta to be dirtied (or undirtied) and then the callee
    returned immediately (e.g both `dsl_pool_dirty_space()` and
    `dsl_pool_undirty_space()` return immediately when space is 0).

In any case and no matter how we got there, the easy fix would be to
just broadcast to all waiters whenever `dp_dirty_total` hits 0. That
said and given that we've never hit this before, it would make sense
to think more on why the above situation occured.

Attempting to mimic what Prakash was doing in the issue filed, I
created a dataset with `sync=always` and started doing contiguous
writes in a file within that dataset. I observed with DTrace that even
though we update the pool's dirty data accounting when we would dirty
stuff, the accounting wouldn't be decremented incrementally as we were
done with the ZIOs of those writes (the reason being that
`dbuf_write_physdone()` isn't be called as we go through the override
code paths, and thus `dsl_pool_undirty_space()` is never called). As a
result we'd have to wait until we get to `dsl_pool_sync()` where we
zero out all dirty data accounting for the pool and the current TXG's
metadata.

In addition, as Matt noted and I later verified, the same issue would
arise when using dedup.

In both cases (sync & dedup) we shouldn't have to wait until
`dsl_pool_sync()` zeros out the accounting data. According to the
comment in that part of the code, the reasons why we do the zeroing,
have nothing to do with what we observe:
````
/*
 * We have written all of the accounted dirty data, so our
 * dp_space_towrite should now be zero.  However, some seldom-used
 * code paths do not adhere to this (e.g. dbuf_undirty(), also
 * rounding error in dbuf_write_physdone).
 * Shore up the accounting of any dirtied space now.
 */
dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg);
````

Ideally what we want to do is to undirty in the accounting exactly what
we dirty (I use the word ideally as we can still have rounding errors).
This would make the behavior of the system more clear and predictable.

Another interesting issue that I observed with DTrace was that we
wouldn't update any of the pool's dirty data accounting whenever we
would dirty and/or undirty MOS data. In addition, every time we would
change the size of a dbuf through `dbuf_new_size()` we wouldn't update
the accounted space dirtied in the appropriate dirty record, so when
ZIOs are done we would undirty less that we dirtied from the pool's
accounting point of view.

For the first two issues observed (sync & dedup) this patch ensures
that we still update the pool's accounting when we undirty data,
regardless of the write being physical or not.

For changes in the MOS, we first ensure to zero out the pool's dirty
data accounting in `dsl_pool_sync()` after we synced the MOS. Then we
can go ahead and enable the update of the pool's dirty data accounting
wheneve we change MOS data.

Another fix is that we now update the accounting explicitly for
counting errors in `dbuf_write_done()`.

Finally, `dbuf_new_size()` updates the accounted space of the
appropriate dirty record correctly now.

The problem is that we still don't know how the bug came up in the
issue filled. That said the issues fixed seem to be very relevant, so
instead of going with the broadcasting solution right away,
I decided to leave this patch as is.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
External-issue: DLPX-47285
Closes #9137
2020-01-22 13:48:57 -08:00
..
abd.c single-chunk scatter ABDs can be treated as linear 2020-01-22 13:48:56 -08:00
aggsum.c OpenZFS 9688 - aggsum_fini leaks memory 2018-10-19 12:08:03 -07:00
arc.c single-chunk scatter ABDs can be treated as linear 2020-01-22 13:48:56 -08:00
blkptr.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
bplist.c Change KM_PUSHPAGE -> KM_SLEEP 2015-01-16 14:41:26 -08:00
bpobj.c Stack overflow in recursive bpobj_iterate_impl 2019-03-06 09:50:55 -08:00
bptree.c Native Encryption for ZFS on Linux 2017-08-14 10:36:48 -07:00
bqueue.c Wait in 'S' state when send/recv pipe is blocking 2019-06-07 12:45:40 -07:00
cityhash.c OpenZFS 8484 - Implement aggregate sum and use for arc counters 2018-06-06 09:35:59 -07:00
dataset_kstats.c port async unlinked drain from illumos-nexenta 2019-02-12 10:41:15 -08:00
dbuf_stats.c Prefix all refcount functions with zfs_ 2018-10-01 10:42:05 -07:00
dbuf.c dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta() 2020-01-22 13:48:57 -08:00
ddt_zap.c fat zap should prefetch when iterating 2019-09-25 11:27:47 -07:00
ddt.c ztest: scrub ddt repair 2019-01-17 15:25:00 -08:00
dmu_diff.c Fix issues found with zfs diff 2018-05-01 11:24:20 -07:00
dmu_object.c Fix send/recv lost spill block 2019-05-07 15:18:44 -07:00
dmu_objset.c dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta() 2020-01-22 13:48:57 -08:00
dmu_recv.c Always refuse receving non-resume stream when resume state exists 2019-09-25 11:27:51 -07:00
dmu_send.c Fix send/recv lost spill block 2019-05-07 15:18:44 -07:00
dmu_traverse.c Fix traverse_impl() kmem leak 2018-08-15 09:53:44 -07:00
dmu_tx.c Improve performance by using dmu_tx_hold_*_by_dnode() 2019-09-25 11:27:50 -07:00
dmu_zfetch.c Linux 5.2 compat: rw_tryupgrade() 2019-05-23 13:46:33 -07:00
dmu.c dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta() 2020-01-22 13:48:57 -08:00
dnode_sync.c Reinstate raw receive check when truncating 2019-06-07 12:45:40 -07:00
dnode.c Assert that a dnode's bonuslen never exceeds its recorded size 2020-01-22 13:48:57 -08:00
dsl_bookmark.c Detect and prevent mixed raw and non-raw sends 2019-03-13 11:00:43 -07:00
dsl_crypt.c Remove VERIFY from dsl_dataset_crypt_stats() 2019-09-25 11:27:49 -07:00
dsl_dataset.c Mutex leak in dsl_dataset_hold_obj() 2019-03-21 10:36:58 -07:00
dsl_deadlist.c OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
dsl_deleg.c Update build system and packaging 2018-05-29 16:00:33 -07:00
dsl_destroy.c Ensure dsl_destroy_head() decrypts objsets 2019-09-25 11:27:49 -07:00
dsl_dir.c Fix TXG_MASK cstyle 2019-04-12 11:30:59 -07:00
dsl_pool.c dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta() 2020-01-22 13:48:57 -08:00
dsl_prop.c Update build system and packaging 2018-05-29 16:00:33 -07:00
dsl_scan.c Disabled resilver_defer feature leads to looping resilvers 2019-09-25 11:27:51 -07:00
dsl_synctask.c OpenZFS 9425 - channel programs can be interrupted 2020-01-22 13:48:56 -08:00
dsl_userhold.c zfs should optionally send holds 2019-02-15 12:41:38 -08:00
edonr_zfs.c DLPX-44812 integrate EP-220 large memory scalability 2016-11-29 14:34:27 -08:00
fm.c Don't wakeup unnecessarily in 'zpool events -f' 2020-01-22 13:48:57 -08:00
gzip.c Update build system and packaging 2018-05-29 16:00:33 -07:00
hkdf.c Encryption patch follow-up 2017-10-11 16:54:48 -04:00
lz4.c Reword comment in lz4_compress_zfs 2019-05-02 16:46:04 -07:00
lzjb.c Change KM_PUSHPAGE -> KM_SLEEP 2015-01-16 14:41:26 -08:00
Makefile.in Add TRIM support 2019-03-29 09:13:20 -07:00
metaslab.c Don't activate metaslabs with weight 0 2020-01-22 13:48:57 -08:00
mmp.c MMP interval and fail_intervals in uberblock 2019-03-21 12:47:57 -07:00
multilist.c Avoid extra taskq_dispatch() calls by DMU 2019-09-25 11:27:48 -07:00
pathname.c Disable unused pathname::pn_path* (unneeded in Linux) 2019-09-25 11:27:49 -07:00
policy.c Implement secpolicy_vnode_setid_retain() 2019-09-25 11:27:50 -07:00
qat_compress.c QAT related bug fixes 2019-09-25 11:27:51 -07:00
qat_crypt.c QAT related bug fixes 2019-09-25 11:27:51 -07:00
qat.c QAT related bug fixes 2019-09-25 11:27:51 -07:00
qat.h QAT related bug fixes 2019-09-25 11:27:51 -07:00
range_tree.c Restrict kstats and print real pointers 2019-04-04 18:57:06 -07:00
refcount.c Prevent race in blkptr_verify against device removal 2020-01-22 13:48:57 -08:00
rrwlock.c Prefix all refcount functions with zfs_ 2018-10-01 10:42:05 -07:00
sa.c Improve performance by using dmu_tx_hold_*_by_dnode() 2019-09-25 11:27:50 -07:00
sha256.c SHA256 QAT acceleration 2018-03-15 10:53:58 -07:00
skein_zfs.c DLPX-44812 integrate EP-220 large memory scalability 2016-11-29 14:34:27 -08:00
spa_boot.c Add linux kernel module support 2010-08-31 13:41:58 -07:00
spa_checkpoint.c Get rid of space_map_update() for ms_synced_length 2019-02-12 10:38:11 -08:00
spa_config.c Fix /etc/hostid on root pool deadlock 2019-09-25 11:27:51 -07:00
spa_errlog.c Update build system and packaging 2018-05-29 16:00:33 -07:00
spa_history.c Create /proc/sys/kernel/spl/gitrev with git hash 2018-10-08 21:57:02 -07:00
spa_misc.c Prevent race in blkptr_verify against device removal 2020-01-22 13:48:57 -08:00
spa_stats.c Restrict kstats and print real pointers 2019-04-04 18:57:06 -07:00
spa.c spa_load_verify() may consume too much memory 2020-01-22 13:48:57 -08:00
space_map.c Restrict kstats and print real pointers 2019-04-04 18:57:06 -07:00
space_reftree.c OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
THIRDPARTYLICENSE.cityhash OpenZFS 8484 - Implement aggregate sum and use for arc counters 2018-06-06 09:35:59 -07:00
THIRDPARTYLICENSE.cityhash.descrip OpenZFS 8484 - Implement aggregate sum and use for arc counters 2018-06-06 09:35:59 -07:00
trace.c OpenZFS 7614, 9064 - zfs device evacuation/removal 2018-04-14 12:16:17 -07:00
txg.c OpenZFS 9425 - channel programs can be interrupted 2020-01-22 13:48:56 -08:00
uberblock.c MMP interval and fail_intervals in uberblock 2019-03-21 12:47:57 -07:00
unique.c Performance optimization of AVL tree comparator functions 2016-08-31 14:35:34 -07:00
vdev_cache.c Update build system and packaging 2018-05-29 16:00:33 -07:00
vdev_disk.c Scrubbing root pools may deadlock on kernels without elevator_change() (#9321) 2019-09-25 11:27:51 -07:00
vdev_file.c Update vdev_ops_t from illumos 2019-09-25 11:27:48 -07:00
vdev_indirect_births.c Fixes: #8934 Large kmem_alloc 2019-09-25 11:27:49 -07:00
vdev_indirect_mapping.c Get rid of space_map_update() for ms_synced_length 2019-02-12 10:38:11 -08:00
vdev_indirect.c Update vdev_ops_t from illumos 2019-09-25 11:27:48 -07:00
vdev_initialize.c Add TRIM support 2019-03-29 09:13:20 -07:00
vdev_label.c panic in removal_remap test on 4K devices 2019-09-25 11:27:47 -07:00
vdev_mirror.c Update vdev_ops_t from illumos 2019-09-25 11:27:48 -07:00
vdev_missing.c Update vdev_ops_t from illumos 2019-09-25 11:27:48 -07:00
vdev_queue.c Move write aggregation memory copy out of vq_lock 2019-09-25 11:27:47 -07:00
vdev_raidz_math_aarch64_neon_common.h Linux 5.0 compat: ASM_BUG macro 2019-05-08 10:18:40 -07:00
vdev_raidz_math_aarch64_neon.c codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
vdev_raidz_math_aarch64_neonx2.c ABD raidz NEON support 2016-11-29 14:34:33 -08:00
vdev_raidz_math_avx2.c Linux 5.0 compat: ASM_BUG macro 2019-05-08 10:18:40 -07:00
vdev_raidz_math_avx512bw.c Linux 5.0 compat: ASM_BUG macro 2019-05-08 10:18:40 -07:00
vdev_raidz_math_avx512f.c Use cstyle -cpP in make cstyle check 2016-12-12 10:46:26 -08:00
vdev_raidz_math_impl.h codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
vdev_raidz_math_scalar.c Linux 5.3: Fix switch() fall though compiler errors 2019-09-25 11:27:50 -07:00
vdev_raidz_math_sse2.c ABD raidz avx512f support 2016-11-29 14:34:33 -08:00
vdev_raidz_math_ssse3.c Linux 5.0 compat: ASM_BUG macro 2019-05-08 10:18:40 -07:00
vdev_raidz_math.c Fix typo in vdev_raidz_math.c 2019-09-25 11:27:47 -07:00
vdev_raidz.c Update vdev_ops_t from illumos 2019-09-25 11:27:48 -07:00
vdev_removal.c panic in removal_remap test on 4K devices 2019-09-25 11:27:47 -07:00
vdev_root.c Update vdev_ops_t from illumos 2019-09-25 11:27:48 -07:00
vdev_trim.c Add TRIM support 2019-03-29 09:13:20 -07:00
vdev.c Allow metaslab to be unloaded even when not freed from 2019-09-25 11:27:47 -07:00
zap_leaf.c Off-by-one in zap_leaf_array_create() 2019-01-18 09:58:46 -08:00
zap_micro.c fat zap should prefetch when iterating 2019-09-25 11:27:47 -07:00
zap.c fat zap should prefetch when iterating 2019-09-25 11:27:47 -07:00
zcp_get.c Fix get_special_prop() build failure 2019-09-25 11:27:49 -07:00
zcp_global.c OpenZFS 8600 - ZFS channel programs - snapshot 2018-02-08 15:29:24 -08:00
zcp_iter.c OpenZFS 9337 - zfs get all is slow due to uncached metadata 2018-07-12 10:49:27 -07:00
zcp_synctask.c OpenZFS 9166 - zfs storage pool checkpoint 2018-06-26 10:07:42 -07:00
zcp.c OpenZFS 9425 - channel programs can be interrupted 2020-01-22 13:48:56 -08:00
zfeature.c Consistently captialize GUID for features 2019-04-16 10:01:51 -07:00
zfs_acl.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_byteswap.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_ctldir.c Change boolean-like uint8_t fields in znode_t to boolean_t 2020-01-22 13:48:57 -08:00
zfs_debug.c Restrict kstats and print real pointers 2019-04-04 18:57:06 -07:00
zfs_dir.c port async unlinked drain from illumos-nexenta 2019-02-12 10:41:15 -08:00
zfs_fm.c Add zpool status -s (slow I/Os) and -p (parseable) 2018-11-08 16:47:24 -08:00
zfs_fuid.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_ioctl.c zfs_ioc_snapshot: check user-prop permissions on snapshotted datasets 2019-09-25 11:27:50 -07:00
zfs_log.c Improve write performance by using dmu_read_by_dnode() 2020-01-22 13:48:57 -08:00
zfs_onexit.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zfs_ratelimit.c Change checksum & IO delay ratelimit values 2018-03-04 17:34:51 -08:00
zfs_replay.c Fix zil replay panic when TX_REMOVE followed by TX_CREATE 2019-09-25 11:27:51 -07:00
zfs_rlock.c OpenZFS 9689 - zfs range lock code should not be zpl-specific 2018-10-11 10:19:33 -07:00
zfs_sa.c Project Quota on ZFS 2018-02-13 14:54:54 -08:00
zfs_sysfs.c Prevent pointer to an out-of-scope local variable 2019-09-25 11:27:48 -07:00
zfs_vfsops.c Make txg_wait_synced conditional in zfsvfs_teardown 2020-01-22 13:48:57 -08:00
zfs_vnops.c Change boolean-like uint8_t fields in znode_t to boolean_t 2020-01-22 13:48:57 -08:00
zfs_znode.c Change boolean-like uint8_t fields in znode_t to boolean_t 2020-01-22 13:48:57 -08:00
zil.c make zil max block size tunable 2020-01-22 13:48:56 -08:00
zio_checksum.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
zio_compress.c OpenZFS 9403 - assertion failed in arc_buf_destroy() 2018-08-29 11:33:33 -07:00
zio_crypt.c Always call rw_init in zio_crypt_key_unwrap 2019-04-10 15:39:40 -07:00
zio_inject.c Multiple DVA Scrubbing Fix 2019-03-15 14:14:31 -07:00
zio.c Prevent race in blkptr_verify against device removal 2020-01-22 13:48:57 -08:00
zle.c Fix zle_decompress out of bound access 2018-02-09 10:08:05 -08:00
zpl_ctldir.c RHEL 7.5 compat: FMODE_KABI_ITERATE 2018-05-02 15:01:24 -07:00
zpl_export.c Use cstyle -cpP in make cstyle check 2016-12-12 10:46:26 -08:00
zpl_file.c Fix errant EFAULT during writes (#8719) 2019-05-08 10:04:04 -07:00
zpl_inode.c Fix errant EFAULT during writes (#8719) 2019-05-08 10:04:04 -07:00
zpl_super.c Fix statfs(2) for 32-bit user space 2018-09-24 17:11:25 -07:00
zpl_xattr.c Drop redundant POSIX ACL check in zpl_init_acl() 2019-09-25 11:27:49 -07:00
zrlock.c Update build system and packaging 2018-05-29 16:00:33 -07:00
zthr.c Fix txg_wait_open() load average inflation 2019-04-04 09:44:46 -07:00
zvol.c make zil max block size tunable 2020-01-22 13:48:56 -08:00