OpenZFS 8909 - 8585 can cause a use-after-free kernel panic

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>

PROBLEM
=======

There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.

Here's an example panic due to this bug:

    > ::status
    debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
    operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
    image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
    panic message:
    BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference
    dump content: kernel pages only

    > $c
    zio_shrink+0x12()
    zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
    zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
    zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
    zil_commit+0x80(ffffff03dcd15cc0, 9a9)
    zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
    fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
    write+0x250(42, fffffd7ff4832000, 2000)
    sys_syscall+0x177()

If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.

A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.

This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).

Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.

This race *is* present in `zil_suspend`, leading to this bug.

At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.

This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnecessary.

So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.

SOLUTION
========

The solution to this, is to ensure all outstanding lwb's complete before
calling `zil_free_lwb` via `zil_destroy` in `zil_suspend`. This patch
accomplishes this goal by forcing the normal `zil_commit` logic when
called from `zil_sync`.

Now, `zil_suspend` will call `zil_commit_impl` which will always use the
normal logic of waiting/issuing lwb's to disk before it returns. As a
result, any lwb's outstanding when `zil_commit_impl` is called will be
guaranteed to reach the `LWB_STATE_DONE` state by the time it returns.

Further, no new lwb's will be created via `zil_commit` since the zilog's
`zl_suspend` flag will be set. This will force all new callers of
`zil_commit` to use `txg_wait_synced` instead of creating and issuing
new lwb's.

Thus, all lwb's left on the zilog's lwb list when `zil_destroy` is
called will be in the `LWB_STATE_DONE` state, and we'll avoid this race
condition.

OpenZFS-issue: https://www.illumos.org/issues/8909
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ece62b6f8d
Closes #6940
This commit is contained in:
Prakash Surya 2017-12-07 11:26:32 -08:00 committed by Brian Behlendorf
parent 823d48bfb1
commit 2fe61a7ecc
9 changed files with 254 additions and 110 deletions

View File

@ -494,6 +494,7 @@ extern void zil_itx_destroy(itx_t *itx);
extern void zil_itx_assign(zilog_t *zilog, itx_t *itx, dmu_tx_t *tx); extern void zil_itx_assign(zilog_t *zilog, itx_t *itx, dmu_tx_t *tx);
extern void zil_commit(zilog_t *zilog, uint64_t oid); extern void zil_commit(zilog_t *zilog, uint64_t oid);
extern void zil_commit_impl(zilog_t *zilog, uint64_t oid);
extern int zil_vdev_offline(const char *osname, void *txarg); extern int zil_vdev_offline(const char *osname, void *txarg);
extern int zil_claim(struct dsl_pool *dp, extern int zil_claim(struct dsl_pool *dp,

View File

@ -36,13 +36,38 @@ extern "C" {
#endif #endif
/* /*
* Possbile states for a given lwb structure. An lwb will start out in * Possible states for a given lwb structure.
* the "closed" state, and then transition to the "opened" state via a *
* call to zil_lwb_write_open(). After the lwb is "open", it can * An lwb will start out in the "closed" state, and then transition to
* transition into the "issued" state via zil_lwb_write_issue(). After * the "opened" state via a call to zil_lwb_write_open(). When
* the lwb's zio completes, and the vdev's are flushed, the lwb will * transitioning from "closed" to "opened" the zilog's "zl_issuer_lock"
* transition into the "done" state via zil_lwb_write_done(), and the * must be held.
* structure eventually freed. *
* After the lwb is "opened", it can transition into the "issued" state
* via zil_lwb_write_issue(). Again, the zilog's "zl_issuer_lock" must
* be held when making this transition.
*
* After the lwb's zio completes, and the vdev's are flushed, the lwb
* will transition into the "done" state via zil_lwb_write_done(). When
* transitioning from "issued" to "done", the zilog's "zl_lock" must be
* held, *not* the "zl_issuer_lock".
*
* The zilog's "zl_issuer_lock" can become heavily contended in certain
* workloads, so we specifically avoid acquiring that lock when
* transitioning an lwb from "issued" to "done". This allows us to avoid
* having to acquire the "zl_issuer_lock" for each lwb ZIO completion,
* which would have added more lock contention on an already heavily
* contended lock.
*
* Additionally, correctness when reading an lwb's state is often
* achieved by exploiting the fact that these state transitions occur in
* this specific order; i.e. "closed" to "opened" to "issued" to "done".
*
* Thus, if an lwb is in the "closed" or "opened" state, holding the
* "zl_issuer_lock" will prevent a concurrent thread from transitioning
* that lwb to the "issued" state. Likewise, if an lwb is already in the
* "issued" state, holding the "zl_lock" will prevent a concurrent
* thread from transitioning that lwb to the "done" state.
*/ */
typedef enum { typedef enum {
LWB_STATE_CLOSED, LWB_STATE_CLOSED,

View File

@ -598,7 +598,6 @@ extern enum zio_checksum zio_checksum_dedup_select(spa_t *spa,
extern enum zio_compress zio_compress_select(spa_t *spa, extern enum zio_compress zio_compress_select(spa_t *spa,
enum zio_compress child, enum zio_compress parent); enum zio_compress child, enum zio_compress parent);
extern void zio_cancel(zio_t *zio);
extern void zio_suspend(spa_t *spa, zio_t *zio); extern void zio_suspend(spa_t *spa, zio_t *zio);
extern int zio_resume(spa_t *spa); extern int zio_resume(spa_t *spa);
extern void zio_resume_wait(spa_t *spa); extern void zio_resume_wait(spa_t *spa);

View File

@ -775,6 +775,21 @@ Disable pool import at module load by ignoring the cache file (typically \fB/etc
Use \fB1\fR for yes (default) and \fB0\fR for no. Use \fB1\fR for yes (default) and \fB0\fR for no.
.RE .RE
.sp
.ne 2
.na
\fBzfs_commit_timeout_pct\fR (int)
.ad
.RS 12n
This controls the amount of time that a ZIL block (lwb) will remain "open"
when it isn't "full", and it has a thread waiting for it to be committed to
stable storage. The timeout is scaled based on a percentage of the last lwb
latency to avoid significantly impacting the latency of each individual
transaction record (itx).
.sp
Default value: \fB5\fR.
.RE
.sp .sp
.ne 2 .ne 2
.na .na
@ -2120,7 +2135,7 @@ Default value: \fB0\fR.
.ad .ad
.RS 12n .RS 12n
The maximum number of taskq entries that are allowed to be cached. When this The maximum number of taskq entries that are allowed to be cached. When this
limit is exceeded itx's will be cleaned synchronously. limit is exceeded transaction records (itxs) will be cleaned synchronously.
.sp .sp
Default value: \fB1048576\fR. Default value: \fB1048576\fR.
.RE .RE

View File

@ -57,7 +57,7 @@
* *
* In the event of a crash or power loss, the itxs contained by each * In the event of a crash or power loss, the itxs contained by each
* dataset's on-disk ZIL will be replayed when that dataset is first * dataset's on-disk ZIL will be replayed when that dataset is first
* instantianted (e.g. if the dataset is a normal fileystem, when it is * instantiated (e.g. if the dataset is a normal fileystem, when it is
* first mounted). * first mounted).
* *
* As hinted at above, there is one ZIL per dataset (both the in-memory * As hinted at above, there is one ZIL per dataset (both the in-memory
@ -539,8 +539,8 @@ zil_alloc_lwb(zilog_t *zilog, blkptr_t *bp, boolean_t slog, uint64_t txg,
ASSERT(!MUTEX_HELD(&lwb->lwb_vdev_lock)); ASSERT(!MUTEX_HELD(&lwb->lwb_vdev_lock));
ASSERT(avl_is_empty(&lwb->lwb_vdev_tree)); ASSERT(avl_is_empty(&lwb->lwb_vdev_tree));
ASSERT(list_is_empty(&lwb->lwb_waiters)); VERIFY(list_is_empty(&lwb->lwb_waiters));
ASSERT(list_is_empty(&lwb->lwb_itxs)); VERIFY(list_is_empty(&lwb->lwb_itxs));
return (lwb); return (lwb);
} }
@ -550,40 +550,14 @@ zil_free_lwb(zilog_t *zilog, lwb_t *lwb)
{ {
ASSERT(MUTEX_HELD(&zilog->zl_lock)); ASSERT(MUTEX_HELD(&zilog->zl_lock));
ASSERT(!MUTEX_HELD(&lwb->lwb_vdev_lock)); ASSERT(!MUTEX_HELD(&lwb->lwb_vdev_lock));
ASSERT(list_is_empty(&lwb->lwb_waiters)); VERIFY(list_is_empty(&lwb->lwb_waiters));
VERIFY(list_is_empty(&lwb->lwb_itxs));
if (lwb->lwb_state == LWB_STATE_OPENED) {
avl_tree_t *t = &lwb->lwb_vdev_tree;
void *cookie = NULL;
zil_vdev_node_t *zv;
itx_t *itx;
while ((zv = avl_destroy_nodes(t, &cookie)) != NULL)
kmem_free(zv, sizeof (*zv));
while ((itx = list_head(&lwb->lwb_itxs)) != NULL) {
if (itx->itx_callback != NULL)
itx->itx_callback(itx->itx_callback_data);
list_remove(&lwb->lwb_itxs, itx);
zil_itx_destroy(itx);
}
ASSERT3P(lwb->lwb_root_zio, !=, NULL);
ASSERT3P(lwb->lwb_write_zio, !=, NULL);
zio_cancel(lwb->lwb_root_zio);
zio_cancel(lwb->lwb_write_zio);
lwb->lwb_root_zio = NULL;
lwb->lwb_write_zio = NULL;
} else {
ASSERT3S(lwb->lwb_state, !=, LWB_STATE_ISSUED);
}
ASSERT(avl_is_empty(&lwb->lwb_vdev_tree)); ASSERT(avl_is_empty(&lwb->lwb_vdev_tree));
ASSERT(list_is_empty(&lwb->lwb_itxs));
ASSERT3P(lwb->lwb_write_zio, ==, NULL); ASSERT3P(lwb->lwb_write_zio, ==, NULL);
ASSERT3P(lwb->lwb_root_zio, ==, NULL); ASSERT3P(lwb->lwb_root_zio, ==, NULL);
ASSERT3U(lwb->lwb_max_txg, <=, spa_syncing_txg(zilog->zl_spa));
ASSERT(lwb->lwb_state == LWB_STATE_CLOSED ||
lwb->lwb_state == LWB_STATE_DONE);
/* /*
* Clear the zilog's field to indicate this lwb is no longer * Clear the zilog's field to indicate this lwb is no longer
@ -708,7 +682,7 @@ zil_create(zilog_t *zilog)
/* /*
* If we just allocated the first log block, commit our transaction * If we just allocated the first log block, commit our transaction
* and wait for zil_sync() to stuff the block poiner into zh_log. * and wait for zil_sync() to stuff the block pointer into zh_log.
* (zh is part of the MOS, so we cannot modify it in open context.) * (zh is part of the MOS, so we cannot modify it in open context.)
*/ */
if (tx != NULL) { if (tx != NULL) {
@ -940,6 +914,12 @@ zil_commit_waiter_skip(zil_commit_waiter_t *zcw)
static void static void
zil_commit_waiter_link_lwb(zil_commit_waiter_t *zcw, lwb_t *lwb) zil_commit_waiter_link_lwb(zil_commit_waiter_t *zcw, lwb_t *lwb)
{ {
/*
* The lwb_waiters field of the lwb is protected by the zilog's
* zl_lock, thus it must be held when calling this function.
*/
ASSERT(MUTEX_HELD(&lwb->lwb_zilog->zl_lock));
mutex_enter(&zcw->zcw_lock); mutex_enter(&zcw->zcw_lock);
ASSERT(!list_link_active(&zcw->zcw_node)); ASSERT(!list_link_active(&zcw->zcw_node));
ASSERT3P(zcw->zcw_lwb, ==, NULL); ASSERT3P(zcw->zcw_lwb, ==, NULL);
@ -1319,7 +1299,7 @@ zil_lwb_write_issue(zilog_t *zilog, lwb_t *lwb)
* close. * close.
* - next find the maximum from the new suggested size and an array of * - next find the maximum from the new suggested size and an array of
* previous sizes. This lessens a picket fence effect of wrongly * previous sizes. This lessens a picket fence effect of wrongly
* guesssing the size if we have a stream of say 2k, 64k, 2k, 64k * guessing the size if we have a stream of say 2k, 64k, 2k, 64k
* requests. * requests.
* *
* Note we only write what is used, but we can't just allocate * Note we only write what is used, but we can't just allocate
@ -1422,8 +1402,10 @@ zil_lwb_commit(zilog_t *zilog, itx_t *itx, lwb_t *lwb)
* For more details, see the comment above zil_commit(). * For more details, see the comment above zil_commit().
*/ */
if (lrc->lrc_txtype == TX_COMMIT) { if (lrc->lrc_txtype == TX_COMMIT) {
mutex_enter(&zilog->zl_lock);
zil_commit_waiter_link_lwb(itx->itx_private, lwb); zil_commit_waiter_link_lwb(itx->itx_private, lwb);
itx->itx_private = NULL; itx->itx_private = NULL;
mutex_exit(&zilog->zl_lock);
return (lwb); return (lwb);
} }
@ -2056,16 +2038,54 @@ zil_process_commit_list(zilog_t *zilog)
list_remove(&zilog->zl_itx_commit_list, itx); list_remove(&zilog->zl_itx_commit_list, itx);
/*
* This is inherently racy and may result in us writing
* out a log block for a txg that was just synced. This
* is ok since we'll end cleaning up that log block the
* next time we call zil_sync().
*/
boolean_t synced = txg <= spa_last_synced_txg(spa); boolean_t synced = txg <= spa_last_synced_txg(spa);
boolean_t frozen = txg > spa_freeze_txg(spa); boolean_t frozen = txg > spa_freeze_txg(spa);
if (!synced || frozen) { /*
* If the txg of this itx has already been synced out, then
* we don't need to commit this itx to an lwb. This is
* because the data of this itx will have already been
* written to the main pool. This is inherently racy, and
* it's still ok to commit an itx whose txg has already
* been synced; this will result in a write that's
* unnecessary, but will do no harm.
*
* With that said, we always want to commit TX_COMMIT itxs
* to an lwb, regardless of whether or not that itx's txg
* has been synced out. We do this to ensure any OPENED lwb
* will always have at least one zil_commit_waiter_t linked
* to the lwb.
*
* As a counter-example, if we skipped TX_COMMIT itx's
* whose txg had already been synced, the following
* situation could occur if we happened to be racing with
* spa_sync:
*
* 1. We commit a non-TX_COMMIT itx to an lwb, where the
* itx's txg is 10 and the last synced txg is 9.
* 2. spa_sync finishes syncing out txg 10.
* 3. We move to the next itx in the list, it's a TX_COMMIT
* whose txg is 10, so we skip it rather than committing
* it to the lwb used in (1).
*
* If the itx that is skipped in (3) is the last TX_COMMIT
* itx in the commit list, than it's possible for the lwb
* used in (1) to remain in the OPENED state indefinitely.
*
* To prevent the above scenario from occurring, ensuring
* that once an lwb is OPENED it will transition to ISSUED
* and eventually DONE, we always commit TX_COMMIT itx's to
* an lwb here, even if that itx's txg has already been
* synced.
*
* Finally, if the pool is frozen, we _always_ commit the
* itx. The point of freezing the pool is to prevent data
* from being written to the main pool via spa_sync, and
* instead rely solely on the ZIL to persistently store the
* data; i.e. when the pool is frozen, the last synced txg
* value can't be trusted.
*/
if (frozen || !synced || lrc->lrc_txtype == TX_COMMIT) {
if (lwb != NULL) { if (lwb != NULL) {
lwb = zil_lwb_commit(zilog, itx, lwb); lwb = zil_lwb_commit(zilog, itx, lwb);
@ -2082,20 +2102,7 @@ zil_process_commit_list(zilog_t *zilog)
list_insert_tail(&nolwb_itxs, itx); list_insert_tail(&nolwb_itxs, itx);
} }
} else { } else {
/* ASSERT3S(lrc->lrc_txtype, !=, TX_COMMIT);
* If this is a commit itx, then there will be a
* thread that is either: already waiting for
* it, or soon will be waiting.
*
* This itx has already been committed to disk
* via spa_sync() so we don't bother committing
* it to an lwb. As a result, we cannot use the
* lwb zio callback to signal the waiter and
* mark it as done, so we must do that here.
*/
if (lrc->lrc_txtype == TX_COMMIT)
zil_commit_waiter_skip(itx->itx_private);
zil_itx_destroy(itx); zil_itx_destroy(itx);
} }
} }
@ -2141,37 +2148,37 @@ zil_process_commit_list(zilog_t *zilog)
* variable is in one of the following states: "closed" * variable is in one of the following states: "closed"
* or "open". * or "open".
* *
* If its "closed", then no itxs have been committed to * If it's "closed", then no itxs have been committed to
* it, so there's no point in issueing its zio (i.e. * it, so there's no point in issuing its zio (i.e. it's
* it's "empty"). * "empty").
* *
* If its "open" state, then it contains one or more * If it's "open", then it contains one or more itxs that
* itxs that eventually need to be committed to stable * eventually need to be committed to stable storage. In
* storage. In this case we intentionally do not issue * this case we intentionally do not issue the lwb's zio
* the lwb's zio to disk yet, and instead rely on one of * to disk yet, and instead rely on one of the following
* the following two mechanisms for issuing the zio: * two mechanisms for issuing the zio:
* *
* 1. Ideally, there will be more ZIL activity occuring * 1. Ideally, there will be more ZIL activity occurring
* on the system, such that this function will be * on the system, such that this function will be
* immeidately called again (not necessarily by the same * immediately called again (not necessarily by the same
* thread) and this lwb's zio will be issued via * thread) and this lwb's zio will be issued via
* zil_lwb_commit(). This way, the lwb is guaranteed to * zil_lwb_commit(). This way, the lwb is guaranteed to
* be "full" when it is issued to disk, and we'll make * be "full" when it is issued to disk, and we'll make
* use of the lwb's size the best we can. * use of the lwb's size the best we can.
* *
* 2. If there isn't sufficient ZIL activity occuring on * 2. If there isn't sufficient ZIL activity occurring on
* the system, such that this lwb's zio isn't issued via * the system, such that this lwb's zio isn't issued via
* zil_lwb_commit(), zil_commit_waiter() will issue the * zil_lwb_commit(), zil_commit_waiter() will issue the
* lwb's zio. If this occurs, the lwb is not guaranteed * lwb's zio. If this occurs, the lwb is not guaranteed
* to be "full" by the time its zio is issued, and means * to be "full" by the time its zio is issued, and means
* the size of the lwb was "too large" given the amount * the size of the lwb was "too large" given the amount
* of ZIL activity occuring on the system at that time. * of ZIL activity occurring on the system at that time.
* *
* We do this for a couple of reasons: * We do this for a couple of reasons:
* *
* 1. To try and reduce the number of IOPs needed to * 1. To try and reduce the number of IOPs needed to
* write the same number of itxs. If an lwb has space * write the same number of itxs. If an lwb has space
* available in it's buffer for more itxs, and more itxs * available in its buffer for more itxs, and more itxs
* will be committed relatively soon (relative to the * will be committed relatively soon (relative to the
* latency of performing a write), then it's beneficial * latency of performing a write), then it's beneficial
* to wait for these "next" itxs. This way, more itxs * to wait for these "next" itxs. This way, more itxs
@ -2193,7 +2200,7 @@ zil_process_commit_list(zilog_t *zilog)
* itxs will be processed. The assumption is the passed in waiter's * itxs will be processed. The assumption is the passed in waiter's
* commit itx will found in the queue just like the other non-commit * commit itx will found in the queue just like the other non-commit
* itxs, such that when the entire queue is processed, the waiter will * itxs, such that when the entire queue is processed, the waiter will
* have been commited to an lwb. * have been committed to an lwb.
* *
* The lwb associated with the passed in waiter is not guaranteed to * The lwb associated with the passed in waiter is not guaranteed to
* have been issued by the time this function completes. If the lwb is * have been issued by the time this function completes. If the lwb is
@ -2205,7 +2212,6 @@ zil_commit_writer(zilog_t *zilog, zil_commit_waiter_t *zcw)
{ {
ASSERT(!MUTEX_HELD(&zilog->zl_lock)); ASSERT(!MUTEX_HELD(&zilog->zl_lock));
ASSERT(spa_writeable(zilog->zl_spa)); ASSERT(spa_writeable(zilog->zl_spa));
ASSERT0(zilog->zl_suspend);
mutex_enter(&zilog->zl_issuer_lock); mutex_enter(&zilog->zl_issuer_lock);
@ -2265,7 +2271,7 @@ zil_commit_waiter_timeout(zilog_t *zilog, zil_commit_waiter_t *zcw)
* In order to call zil_lwb_write_issue() we must hold the * In order to call zil_lwb_write_issue() we must hold the
* zilog's "zl_issuer_lock". We can't simply acquire that lock, * zilog's "zl_issuer_lock". We can't simply acquire that lock,
* since we're already holding the commit waiter's "zcw_lock", * since we're already holding the commit waiter's "zcw_lock",
* and those two locks are aquired in the opposite order * and those two locks are acquired in the opposite order
* elsewhere. * elsewhere.
*/ */
mutex_exit(&zcw->zcw_lock); mutex_exit(&zcw->zcw_lock);
@ -2286,10 +2292,24 @@ zil_commit_waiter_timeout(zilog_t *zilog, zil_commit_waiter_t *zcw)
ASSERT3P(lwb, ==, zcw->zcw_lwb); ASSERT3P(lwb, ==, zcw->zcw_lwb);
/* /*
* We've already checked this above, but since we hadn't * We've already checked this above, but since we hadn't acquired
* acquired the zilog's zl_issuer_lock, we have to perform this * the zilog's zl_issuer_lock, we have to perform this check a
* check a second time while holding the lock. We can't call * second time while holding the lock.
*
* We don't need to hold the zl_lock since the lwb cannot transition
* from OPENED to ISSUED while we hold the zl_issuer_lock. The lwb
* _can_ transition from ISSUED to DONE, but it's OK to race with
* that transition since we treat the lwb the same, whether it's in
* the ISSUED or DONE states.
*
* The important thing, is we treat the lwb differently depending on
* if it's ISSUED or OPENED, and block any other threads that might
* attempt to issue this lwb. For that reason we hold the
* zl_issuer_lock when checking the lwb_state; we must not call
* zil_lwb_write_issue() if the lwb had already been issued. * zil_lwb_write_issue() if the lwb had already been issued.
*
* See the comment above the lwb_state_t structure definition for
* more details on the lwb states, and locking requirements.
*/ */
if (lwb->lwb_state == LWB_STATE_ISSUED || if (lwb->lwb_state == LWB_STATE_ISSUED ||
lwb->lwb_state == LWB_STATE_DONE) lwb->lwb_state == LWB_STATE_DONE)
@ -2315,7 +2335,7 @@ zil_commit_waiter_timeout(zilog_t *zilog, zil_commit_waiter_t *zcw)
* By having to issue the lwb's zio here, it means the size of the * By having to issue the lwb's zio here, it means the size of the
* lwb was too large, given the incoming throughput of itxs. By * lwb was too large, given the incoming throughput of itxs. By
* setting "zl_cur_used" to zero, we communicate this fact to the * setting "zl_cur_used" to zero, we communicate this fact to the
* block size selection algorithm, so it can take this informaiton * block size selection algorithm, so it can take this information
* into account, and potentially select a smaller size for the * into account, and potentially select a smaller size for the
* next lwb block that is allocated. * next lwb block that is allocated.
*/ */
@ -2379,7 +2399,6 @@ zil_commit_waiter(zilog_t *zilog, zil_commit_waiter_t *zcw)
ASSERT(!MUTEX_HELD(&zilog->zl_lock)); ASSERT(!MUTEX_HELD(&zilog->zl_lock));
ASSERT(!MUTEX_HELD(&zilog->zl_issuer_lock)); ASSERT(!MUTEX_HELD(&zilog->zl_issuer_lock));
ASSERT(spa_writeable(zilog->zl_spa)); ASSERT(spa_writeable(zilog->zl_spa));
ASSERT0(zilog->zl_suspend);
mutex_enter(&zcw->zcw_lock); mutex_enter(&zcw->zcw_lock);
@ -2459,7 +2478,7 @@ zil_commit_waiter(zilog_t *zilog, zil_commit_waiter_t *zcw)
* complete. * complete.
* *
* Additionally, if the lwb is NULL, the waiter * Additionally, if the lwb is NULL, the waiter
* will soon be signalled and marked done via * will soon be signaled and marked done via
* zil_clean() and zil_itxg_clean(), so no timeout * zil_clean() and zil_itxg_clean(), so no timeout
* is required. * is required.
*/ */
@ -2560,7 +2579,7 @@ zil_commit_itx_assign(zilog_t *zilog, zil_commit_waiter_t *zcw)
* When a thread calls zil_commit() a special "commit itx" will be * When a thread calls zil_commit() a special "commit itx" will be
* generated, along with a corresponding "waiter" for this commit itx. * generated, along with a corresponding "waiter" for this commit itx.
* zil_commit() will wait on this waiter's CV, such that when the waiter * zil_commit() will wait on this waiter's CV, such that when the waiter
* is marked done, and signalled, zil_commit() will return. * is marked done, and signaled, zil_commit() will return.
* *
* This commit itx is inserted into the queue of uncommitted itxs. This * This commit itx is inserted into the queue of uncommitted itxs. This
* provides an easy mechanism for determining which itxs were in the * provides an easy mechanism for determining which itxs were in the
@ -2570,7 +2589,7 @@ zil_commit_itx_assign(zilog_t *zilog, zil_commit_waiter_t *zcw)
* The commit it is special; it doesn't have any on-disk representation. * The commit it is special; it doesn't have any on-disk representation.
* When a commit itx is "committed" to an lwb, the waiter associated * When a commit itx is "committed" to an lwb, the waiter associated
* with it is linked onto the lwb's list of waiters. Then, when that lwb * with it is linked onto the lwb's list of waiters. Then, when that lwb
* completes, each waiter on the lwb's list is marked done and signalled * completes, each waiter on the lwb's list is marked done and signaled
* -- allowing the thread waiting on the waiter to return from zil_commit(). * -- allowing the thread waiting on the waiter to return from zil_commit().
* *
* It's important to point out a few critical factors that allow us * It's important to point out a few critical factors that allow us
@ -2578,10 +2597,10 @@ zil_commit_itx_assign(zilog_t *zilog, zil_commit_waiter_t *zcw)
* commit waiters, and zio completion callbacks like we're doing: * commit waiters, and zio completion callbacks like we're doing:
* *
* 1. The list of waiters for each lwb is traversed, and each commit * 1. The list of waiters for each lwb is traversed, and each commit
* waiter is marked "done" and signalled, in the zio completion * waiter is marked "done" and signaled, in the zio completion
* callback of the lwb's zio[*]. * callback of the lwb's zio[*].
* *
* * Actually, the waiters are signalled in the zio completion * * Actually, the waiters are signaled in the zio completion
* callback of the root zio for the DKIOCFLUSHWRITECACHE commands * callback of the root zio for the DKIOCFLUSHWRITECACHE commands
* that are sent to the vdevs upon completion of the lwb zio. * that are sent to the vdevs upon completion of the lwb zio.
* *
@ -2629,7 +2648,7 @@ zil_commit_itx_assign(zilog_t *zilog, zil_commit_waiter_t *zcw)
* the zilog's "zl_issuer_lock", which ensures only a single * the zilog's "zl_issuer_lock", which ensures only a single
* thread is creating and/or issuing lwb's at a time * thread is creating and/or issuing lwb's at a time
* 2. The "previous" lwb is a child of the "current" lwb * 2. The "previous" lwb is a child of the "current" lwb
* (leveraging the zio parent-child depenency graph) * (leveraging the zio parent-child dependency graph)
* *
* By relying on this parent-child zio relationship, we can have * By relying on this parent-child zio relationship, we can have
* many lwb zio's concurrently issued to the underlying storage, * many lwb zio's concurrently issued to the underlying storage,
@ -2662,7 +2681,7 @@ zil_commit(zilog_t *zilog, uint64_t foid)
* If the SPA is not writable, there should never be any * If the SPA is not writable, there should never be any
* pending itxs waiting to be committed to disk. If that * pending itxs waiting to be committed to disk. If that
* weren't true, we'd skip writing those itxs out, and * weren't true, we'd skip writing those itxs out, and
* would break the sematics of zil_commit(); thus, we're * would break the semantics of zil_commit(); thus, we're
* verifying that truth before we return to the caller. * verifying that truth before we return to the caller.
*/ */
ASSERT(list_is_empty(&zilog->zl_lwb_list)); ASSERT(list_is_empty(&zilog->zl_lwb_list));
@ -2684,6 +2703,12 @@ zil_commit(zilog_t *zilog, uint64_t foid)
return; return;
} }
zil_commit_impl(zilog, foid);
}
void
zil_commit_impl(zilog_t *zilog, uint64_t foid)
{
ZIL_STAT_BUMP(zil_commit_count); ZIL_STAT_BUMP(zil_commit_count);
/* /*
@ -3149,7 +3174,22 @@ zil_suspend(const char *osname, void **cookiep)
zilog->zl_suspending = B_TRUE; zilog->zl_suspending = B_TRUE;
mutex_exit(&zilog->zl_lock); mutex_exit(&zilog->zl_lock);
zil_commit(zilog, 0); /*
* We need to use zil_commit_impl to ensure we wait for all
* LWB_STATE_OPENED and LWB_STATE_ISSUED lwbs to be committed
* to disk before proceeding. If we used zil_commit instead, it
* would just call txg_wait_synced(), because zl_suspend is set.
* txg_wait_synced() doesn't wait for these lwb's to be
* LWB_STATE_DONE before returning.
*/
zil_commit_impl(zilog, 0);
/*
* Now that we've ensured all lwb's are LWB_STATE_DONE, we use
* txg_wait_synced() to ensure the data from the zilog has
* migrated to the main pool before calling zil_destroy().
*/
txg_wait_synced(zilog->zl_dmu_pool, 0);
zil_destroy(zilog, B_FALSE); zil_destroy(zilog, B_FALSE);
@ -3400,6 +3440,9 @@ EXPORT_SYMBOL(zil_set_sync);
EXPORT_SYMBOL(zil_set_logbias); EXPORT_SYMBOL(zil_set_logbias);
/* BEGIN CSTYLED */ /* BEGIN CSTYLED */
module_param(zfs_commit_timeout_pct, int, 0644);
MODULE_PARM_DESC(zfs_commit_timeout_pct, "ZIL block open timeout percentage");
module_param(zil_replay_disable, int, 0644); module_param(zil_replay_disable, int, 0644);
MODULE_PARM_DESC(zil_replay_disable, "Disable intent logging replay"); MODULE_PARM_DESC(zil_replay_disable, "Disable intent logging replay");

View File

@ -1987,20 +1987,6 @@ zio_reexecute(zio_t *pio)
} }
} }
void
zio_cancel(zio_t *zio)
{
/*
* Disallow cancellation of a zio that's already been issued.
*/
VERIFY3P(zio->io_executor, ==, NULL);
zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
zio->io_done = NULL;
zio_nowait(zio);
}
void void
zio_suspend(spa_t *spa, zio_t *zio) zio_suspend(spa_t *spa, zio_t *zio)
{ {

View File

@ -656,7 +656,8 @@ tags = ['functional', 'scrub_mirror']
tests = ['slog_001_pos', 'slog_002_pos', 'slog_003_pos', 'slog_004_pos', tests = ['slog_001_pos', 'slog_002_pos', 'slog_003_pos', 'slog_004_pos',
'slog_005_pos', 'slog_006_pos', 'slog_007_pos', 'slog_008_neg', 'slog_005_pos', 'slog_006_pos', 'slog_007_pos', 'slog_008_neg',
'slog_009_neg', 'slog_010_neg', 'slog_011_neg', 'slog_012_neg', 'slog_009_neg', 'slog_010_neg', 'slog_011_neg', 'slog_012_neg',
'slog_013_pos', 'slog_014_pos', 'slog_replay_fs', 'slog_replay_volume'] 'slog_013_pos', 'slog_014_pos', 'slog_015_neg', 'slog_replay_fs',
'slog_replay_volume']
tags = ['functional', 'slog'] tags = ['functional', 'slog']
[tests/functional/snapshot] [tests/functional/snapshot]

View File

@ -18,5 +18,6 @@ dist_pkgdata_SCRIPTS = \
slog_012_neg.ksh \ slog_012_neg.ksh \
slog_013_pos.ksh \ slog_013_pos.ksh \
slog_014_pos.ksh \ slog_014_pos.ksh \
slog_015_neg.ksh \
slog_replay_fs.ksh \ slog_replay_fs.ksh \
slog_replay_volume.ksh slog_replay_volume.ksh

View File

@ -0,0 +1,73 @@
#!/bin/ksh -p
#
# This file and its contents are supplied under the terms of the
# Common Development and Distribution License ("CDDL"), version 1.0.
# You may only use this file in accordance with the terms of version
# 1.0 of the CDDL.
#
# A full copy of the text of the CDDL should have accompanied this
# source. A copy of the CDDL is also available via the Internet at
# http://www.illumos.org/license/CDDL.
#
#
# Copyright (c) 2017 by Delphix. All rights reserved.
#
. $STF_SUITE/tests/functional/slog/slog.kshlib
#
# DESCRIPTION:
# Concurrent sync writes with log offline/online works.
#
# STRATEGY:
# 1. Configure "zfs_commit_timeout_pct"
# 2. Create pool with a log device.
# 3. Concurrently do the following:
# 3.1. Perform 8K sync writes
# 3.2. Perform log offline/online commands
# 4. Loop to test with growing "zfs_commit_timout_pct" values.
#
verify_runnable "global"
function cleanup
{
#
# Wait for any of the writes and/or zpool commands that were
# kicked off in the background to complete. On failure, we may
# enter this function without previously waiting for them.
#
wait
set_tunable64 zfs_commit_timeout_pct $ORIG_TIMEOUT
poolexists $TESTPOOL && zpool destroy -f $TESTPOOL
}
ORIG_TIMEOUT=$(get_tunable zfs_commit_timeout_pct | tail -1 | awk '{print $NF}')
log_onexit cleanup
for PCT in 0 1 2 4 8 16 32 64 128 256 512 1024; do
log_must set_tunable64 zfs_commit_timeout_pct $PCT
log_must zpool create $TESTPOOL $VDEV log $SDEV
for i in {1..10}; do
log_must fio --rw write --sync 1 --directory "/$TESTPOOL" \
--bs 8K --size 8K --name slog-test
done &
for i in {1..10}; do
log_must zpool offline $TESTPOOL $SDEV
log_must verify_slog_device $TESTPOOL $SDEV 'OFFLINE'
log_must zpool online $TESTPOOL $SDEV
log_must verify_slog_device $TESTPOOL $SDEV 'ONLINE'
done &
wait
log_must zpool destroy -f $TESTPOOL
done
log_pass "Concurrent writes with slog offline/online works."