mirror_zfs/module/zfs
Prakash Surya 2fe61a7ecc OpenZFS 8909 - 8585 can cause a use-after-free kernel panic
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>

PROBLEM
=======

There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.

Here's an example panic due to this bug:

    > ::status
    debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
    operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
    image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
    panic message:
    BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference
    dump content: kernel pages only

    > $c
    zio_shrink+0x12()
    zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
    zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
    zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
    zil_commit+0x80(ffffff03dcd15cc0, 9a9)
    zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
    fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
    write+0x250(42, fffffd7ff4832000, 2000)
    sys_syscall+0x177()

If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.

A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.

This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).

Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.

This race *is* present in `zil_suspend`, leading to this bug.

At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.

This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnecessary.

So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.

SOLUTION
========

The solution to this, is to ensure all outstanding lwb's complete before
calling `zil_free_lwb` via `zil_destroy` in `zil_suspend`. This patch
accomplishes this goal by forcing the normal `zil_commit` logic when
called from `zil_sync`.

Now, `zil_suspend` will call `zil_commit_impl` which will always use the
normal logic of waiting/issuing lwb's to disk before it returns. As a
result, any lwb's outstanding when `zil_commit_impl` is called will be
guaranteed to reach the `LWB_STATE_DONE` state by the time it returns.

Further, no new lwb's will be created via `zil_commit` since the zilog's
`zl_suspend` flag will be set. This will force all new callers of
`zil_commit` to use `txg_wait_synced` instead of creating and issuing
new lwb's.

Thus, all lwb's left on the zilog's lwb list when `zil_destroy` is
called will be in the `LWB_STATE_DONE` state, and we'll avoid this race
condition.

OpenZFS-issue: https://www.illumos.org/issues/8909
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ece62b6f8d
Closes #6940
2017-12-28 10:18:04 -08:00
..
abd.c Update for cppcheck v1.80 2017-11-18 14:08:00 -08:00
arc.c Support re-prioritizing asynchronous prefetches 2017-12-21 09:13:06 -08:00
blkptr.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
bplist.c
bpobj.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
bptree.c Native Encryption for ZFS on Linux 2017-08-14 10:36:48 -07:00
bqueue.c Call cv_signal() with mutex held 2017-06-26 14:36:49 -07:00
dbuf_stats.c Improved dnode allocation and dmu_hold_impl() 2017-09-05 16:15:04 -07:00
dbuf.c Sequential scrub and resilvers 2017-11-15 17:27:01 -08:00
ddt_zap.c
ddt.c Sequential scrub and resilvers 2017-11-15 17:27:01 -08:00
dmu_diff.c
dmu_object.c Improved dnode allocation and dmu_hold_impl() 2017-09-05 16:15:04 -07:00
dmu_objset.c Long hold the dataset during upgrade 2017-11-10 13:37:10 -08:00
dmu_send.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
dmu_traverse.c Sequential scrub and resilvers 2017-11-15 17:27:01 -08:00
dmu_tx.c Call commit callbacks from the tail of the list 2017-12-22 10:19:51 -08:00
dmu_zfetch.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
dmu.c OpenZFS 8585 - improve batching done in zil_commit() 2017-12-05 09:39:16 -08:00
dnode_sync.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
dnode.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
dsl_bookmark.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
dsl_crypt.c Fix encryption root hierarchy issue 2017-11-08 15:25:30 -08:00
dsl_dataset.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
dsl_deadlist.c OpenZFS 5428 - provide fts(), reallocarray(), and strtonum() 2017-07-08 20:35:35 -07:00
dsl_deleg.c
dsl_destroy.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
dsl_dir.c Native Encryption for ZFS on Linux 2017-08-14 10:36:48 -07:00
dsl_pool.c Sequential scrub and resilvers 2017-11-15 17:27:01 -08:00
dsl_prop.c Native Encryption for ZFS on Linux 2017-08-14 10:36:48 -07:00
dsl_scan.c Support re-prioritizing asynchronous prefetches 2017-12-21 09:13:06 -08:00
dsl_synctask.c
dsl_userhold.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
edonr_zfs.c
fm.c Linux 4.14 compat: CONFIG_GCC_PLUGIN_RANDSTRUCT 2017-11-28 17:33:48 -06:00
gzip.c GZIP compression offloading with QAT accelerator 2017-03-22 17:58:47 -07:00
hkdf.c Encryption patch follow-up 2017-10-11 16:54:48 -04:00
lz4.c Fix LZ4_uncompress_unknownOutputSize caused panic 2017-05-19 13:45:46 -07:00
lzjb.c
Makefile.in Suppress incorrect objtool warnings 2017-12-07 10:28:50 -08:00
metaslab.c Sequential scrub and resilvers 2017-11-15 17:27:01 -08:00
mmp.c Reimplement vdev_random_leaf and rename it 2017-09-22 14:29:26 -07:00
multilist.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
pathname.c
policy.c codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
qat_compress.c Bug fix in qat_compress.c when compressed size is < 4KB 2017-11-07 14:51:30 -08:00
qat_compress.h GZIP compression offloading with QAT accelerator 2017-03-22 17:58:47 -07:00
range_tree.c Sequential scrub and resilvers 2017-11-15 17:27:01 -08:00
refcount.c Linux 4.11 compat: avoid refcount_t name conflict 2017-02-28 16:10:18 -08:00
rrwlock.c
sa.c Fix coverity defects: CID 147474 2017-10-10 16:41:47 -07:00
sha256.c
skein_zfs.c
spa_boot.c
spa_config.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
spa_errlog.c Native Encryption for ZFS on Linux 2017-08-14 10:36:48 -07:00
spa_history.c Emit history events for 'zpool create' 2017-10-23 09:45:59 -07:00
spa_misc.c Unbreak the scan status ABI 2017-11-30 09:40:13 -08:00
spa_stats.c Update the default for zfs_txg_history 2017-09-29 15:58:52 -07:00
spa.c Fix multihost stale cache file import 2017-12-18 10:28:27 -08:00
space_map.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
space_reftree.c OpenZFS 6328 - Fix cstyle errors in zfs codebase 2017-01-12 09:42:11 -08:00
trace.c
txg.c OpenZFS 8585 - improve batching done in zil_commit() 2017-12-05 09:39:16 -08:00
uberblock.c Multi-modifier protection (MMP) 2017-07-13 13:54:00 -04:00
unique.c
vdev_cache.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
vdev_disk.c Fix printk() calls missing log level 2017-09-25 10:38:27 -07:00
vdev_file.c Skip spurious resilver IO on raidz vdev 2017-05-12 17:28:03 -07:00
vdev_label.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
vdev_mirror.c Linux 4.14 compat: CONFIG_GCC_PLUGIN_RANDSTRUCT 2017-11-28 17:33:48 -06:00
vdev_missing.c Skip spurious resilver IO on raidz vdev 2017-05-12 17:28:03 -07:00
vdev_queue.c Support re-prioritizing asynchronous prefetches 2017-12-21 09:13:06 -08:00
vdev_raidz_math_aarch64_neon_common.h
vdev_raidz_math_aarch64_neon.c codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
vdev_raidz_math_aarch64_neonx2.c
vdev_raidz_math_avx2.c
vdev_raidz_math_avx512bw.c
vdev_raidz_math_avx512f.c
vdev_raidz_math_impl.h codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
vdev_raidz_math_scalar.c
vdev_raidz_math_sse2.c
vdev_raidz_math_ssse3.c codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
vdev_raidz_math.c codebase style improvements for OpenZFS 6459 port 2017-01-22 13:25:40 -08:00
vdev_raidz.c Linux 4.14 compat: CONFIG_GCC_PLUGIN_RANDSTRUCT 2017-11-28 17:33:48 -06:00
vdev_root.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
vdev.c Sequential scrub and resilvers 2017-11-15 17:27:01 -08:00
zap_leaf.c Use SET_ERROR for constant non-zero return codes 2017-08-02 21:16:12 -07:00
zap_micro.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
zap.c Sequential scrub and resilvers 2017-11-15 17:27:01 -08:00
zfeature.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
zfs_acl.c Linux 4.14 compat: CONFIG_GCC_PLUGIN_RANDSTRUCT 2017-11-28 17:33:48 -06:00
zfs_byteswap.c
zfs_ctldir.c Use SET_ERROR for constant non-zero return codes 2017-08-02 21:16:12 -07:00
zfs_debug.c Add line info and SET_ERROR() to ZFS debug log 2017-07-25 23:09:48 -07:00
zfs_dir.c Fix NFS sticky bit permission denied error 2017-12-04 11:55:57 -08:00
zfs_fm.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
zfs_fuid.c Rename zfs_sb_t -> zfsvfs_t 2017-03-10 09:51:33 -08:00
zfs_ioctl.c Long hold the dataset during upgrade 2017-11-10 13:37:10 -08:00
zfs_log.c OpenZFS 7578 - Fix/improve some aspects of ZIL writing 2017-06-09 09:15:37 -07:00
zfs_onexit.c
zfs_ratelimit.c Add libtpool (thread pools) 2017-08-09 15:31:08 -07:00
zfs_replay.c OpenZFS 8081 - Compiler warnings in zdb 2017-10-27 12:46:35 -07:00
zfs_rlock.c
zfs_sa.c Modifying XATTRs doesnt change the ctime 2017-09-13 12:20:07 -07:00
zfs_vfsops.c Fix 'zfs get {user|group}objused@' functionality 2017-11-29 11:59:22 -08:00
zfs_vnops.c OpenZFS 8585 - improve batching done in zil_commit() 2017-12-05 09:39:16 -08:00
zfs_znode.c misc: fix meaningless values 2017-09-19 12:19:08 -07:00
zil.c OpenZFS 8909 - 8585 can cause a use-after-free kernel panic 2017-12-28 10:18:04 -08:00
zio_checksum.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
zio_compress.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
zio_crypt.c Fix for #6714 2017-10-11 16:59:42 -04:00
zio_inject.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
zio.c OpenZFS 8909 - 8585 can cause a use-after-free kernel panic 2017-12-28 10:18:04 -08:00
zle.c
zpl_ctldir.c Linux 4.12 compat: CURRENT_TIME removed 2017-05-10 09:30:48 -07:00
zpl_export.c
zpl_file.c misc: fix meaningless values 2017-09-19 12:19:08 -07:00
zpl_inode.c Linux 4.12 compat: CURRENT_TIME removed 2017-05-10 09:30:48 -07:00
zpl_super.c Restructure mount option handling 2017-03-10 09:51:41 -08:00
zpl_xattr.c Update for cppcheck v1.80 2017-11-18 14:08:00 -08:00
zrlock.c Undo c89 workarounds to match with upstream 2017-11-04 13:25:13 -07:00
zvol.c OpenZFS 8585 - improve batching done in zil_commit() 2017-12-05 09:39:16 -08:00