2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* CDDL HEADER START
|
|
|
|
*
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
*
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
|
|
|
* or http://www.opensolaris.org/os/licensing.
|
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
* and limitations under the License.
|
|
|
|
*
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
*
|
|
|
|
* CDDL HEADER END
|
|
|
|
*/
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
|
2019-06-10 21:48:42 +03:00
|
|
|
* Copyright (c) 2012, 2018 by Delphix. All rights reserved.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* Portions Copyright 2010 Robert Milkowski */
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
#ifndef _SYS_ZIL_H
|
|
|
|
#define _SYS_ZIL_H
|
|
|
|
|
|
|
|
#include <sys/types.h>
|
|
|
|
#include <sys/spa.h>
|
|
|
|
#include <sys/zio.h>
|
|
|
|
#include <sys/dmu.h>
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
#include <sys/zio_crypt.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
#ifdef __cplusplus
|
|
|
|
extern "C" {
|
|
|
|
#endif
|
|
|
|
|
2015-05-06 19:07:55 +03:00
|
|
|
struct dsl_pool;
|
|
|
|
struct dsl_dataset;
|
OpenZFS 8585 - improve batching done in zil_commit()
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>
Problem
=======
The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:
1. When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.
As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).
To further complicate the situation, these two issues result in the
following side effect:
3. If a given batch takes longer to complete than normal, this results
in larger batch sizes, which then take longer to complete and
further drive up the latency of zil_commit(). This can occur for a
number of reasons, including (but not limited to): transient changes
in the workload, and storage latency irregularites.
Solution
========
The solution attempted by this change has the following goals:
1. no on-disk changes; maintain current on-disk format.
2. modify the "batch size" to be equal to the "ZIL block size".
3. allow new batches to be generated and issued to disk, while there's
already batches being serviced by the disk.
4. allow zil_commit() to wait for as few ZIL blocks as possible.
5. use as few ZIL blocks as possible, for the same amount of ZIL
transactions, without introducing significant latency to any
individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.
In theory, with these goals met, the new allgorithm will allow the
following improvements:
1. new ZIL blocks can be generated and issued, while there's already
oustanding ZIL blocks being serviced by the storage.
2. the latency of zil_commit() should be proportional to the underlying
storage latency, rather than the incoming synchronous workload.
Porting Notes
=============
Due to the changes made in commit 119a394ab0, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).
To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:
* A list of itxs was added to the lwb structure. This list contains
all of the itxs that have been committed to the lwb, such that the
callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
after the data for the itxs is committed to disk.
* A list of itxs was added on the stack of the zil_process_commit_list()
function; the "nolwb_itxs" list. In some circumstances, an itx may
not be committed to an lwb (e.g. if allocating the "next" ZIL block
on disk fails), so this list is used to keep track of which itxs
fall into this state, such that their callbacks can be called after
the ZIL's writer pipeline is "stalled".
* The logic to actually call the itx's callback was moved into the
zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
were effectively performing the same logic (i.e. if callback is
non-null, call the callback), it seemed like useful code cleanup to
consolidate this logic into a single function.
Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:
* The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
removed from "trace_zil.h" as well.
* Some of the zilog structure's fields were removed, which affected
the tracepoint definitions of the structure.
* New tracepoints had to be added for the following 3 new probes:
* zil__process__commit__itx
* zil__process__normal__itx
* zil__commit__io__error
OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5d95a3a
Closes #6566
2017-12-05 20:39:16 +03:00
|
|
|
struct lwb;
|
2015-05-06 19:07:55 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Intent log format:
|
|
|
|
*
|
|
|
|
* Each objset has its own intent log. The log header (zil_header_t)
|
|
|
|
* for objset N's intent log is kept in the Nth object of the SPA's
|
|
|
|
* intent_log objset. The log header points to a chain of log blocks,
|
|
|
|
* each of which contains log records (i.e., transactions) followed by
|
|
|
|
* a log block trailer (zil_trailer_t). The format of a log record
|
|
|
|
* depends on the record (or transaction) type, but all records begin
|
|
|
|
* with a common structure that defines the type, length, and txg.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Intent log header - this on disk structure holds fields to manage
|
|
|
|
* the log. All fields are 64 bit to easily handle cross architectures.
|
|
|
|
*/
|
|
|
|
typedef struct zil_header {
|
|
|
|
uint64_t zh_claim_txg; /* txg in which log blocks were claimed */
|
|
|
|
uint64_t zh_replay_seq; /* highest replayed sequence number */
|
|
|
|
blkptr_t zh_log; /* log chain */
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t zh_claim_blk_seq; /* highest claimed block sequence number */
|
2009-07-03 02:44:48 +04:00
|
|
|
uint64_t zh_flags; /* header flags */
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t zh_claim_lr_seq; /* highest claimed lr sequence number */
|
|
|
|
uint64_t zh_pad[3];
|
2008-11-20 23:01:55 +03:00
|
|
|
} zil_header_t;
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* zh_flags bit settings
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
#define ZIL_REPLAY_NEEDED 0x1 /* replay needed - internal only */
|
|
|
|
#define ZIL_CLAIM_LR_SEQ_VALID 0x2 /* zh_claim_lr_seq field is valid */
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Log block chaining.
|
|
|
|
*
|
|
|
|
* Log blocks are chained together. Originally they were chained at the
|
|
|
|
* end of the block. For performance reasons the chain was moved to the
|
|
|
|
* beginning of the block which allows writes for only the data being used.
|
2019-08-30 19:53:15 +03:00
|
|
|
* The older position is supported for backwards compatibility.
|
2008-11-20 23:01:55 +03:00
|
|
|
*
|
2010-05-29 00:45:14 +04:00
|
|
|
* The zio_eck_t contains a zec_cksum which for the intent log is
|
2008-11-20 23:01:55 +03:00
|
|
|
* the sequence number of this log block. A seq of 0 is invalid.
|
2010-05-29 00:45:14 +04:00
|
|
|
* The zec_cksum is checked by the SPA against the sequence
|
2008-11-20 23:01:55 +03:00
|
|
|
* number passed in the blk_cksum field of the blkptr_t
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
typedef struct zil_chain {
|
|
|
|
uint64_t zc_pad;
|
|
|
|
blkptr_t zc_next_blk; /* next block in chain */
|
|
|
|
uint64_t zc_nused; /* bytes in log block used */
|
|
|
|
zio_eck_t zc_eck; /* block trailer */
|
|
|
|
} zil_chain_t;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
#define ZIL_MIN_BLKSZ 4096ULL
|
|
|
|
|
2017-04-24 19:34:36 +03:00
|
|
|
/*
|
|
|
|
* ziltest is by and large an ugly hack, but very useful in
|
|
|
|
* checking replay without tedious work.
|
|
|
|
* When running ziltest we want to keep all itx's and so maintain
|
|
|
|
* a single list in the zl_itxg[] that uses a high txg: ZILTEST_TXG
|
|
|
|
* We subtract TXG_CONCURRENT_STATES to allow for common code.
|
|
|
|
*/
|
|
|
|
#define ZILTEST_TXG (UINT64_MAX - TXG_CONCURRENT_STATES)
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* The words of a log block checksum.
|
|
|
|
*/
|
|
|
|
#define ZIL_ZC_GUID_0 0
|
|
|
|
#define ZIL_ZC_GUID_1 1
|
|
|
|
#define ZIL_ZC_OBJSET 2
|
|
|
|
#define ZIL_ZC_SEQ 3
|
|
|
|
|
|
|
|
typedef enum zil_create {
|
|
|
|
Z_FILE,
|
|
|
|
Z_DIR,
|
|
|
|
Z_XATTRDIR,
|
|
|
|
} zil_create_t;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* size of xvattr log section.
|
|
|
|
* its composed of lr_attr_t + xvattr bitmap + 2 64 bit timestamps
|
|
|
|
* for create time and a single 64 bit integer for all of the attributes,
|
|
|
|
* and 4 64 bit integers (32 bytes) for the scanstamp.
|
|
|
|
*
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define ZIL_XVAT_SIZE(mapsize) \
|
|
|
|
sizeof (lr_attr_t) + (sizeof (uint32_t) * (mapsize - 1)) + \
|
|
|
|
(sizeof (uint64_t) * 7)
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Size of ACL in log. The ACE data is padded out to properly align
|
|
|
|
* on 8 byte boundary.
|
|
|
|
*/
|
|
|
|
|
|
|
|
#define ZIL_ACE_LENGTH(x) (roundup(x, sizeof (uint64_t)))
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Intent log transaction types and record structures
|
|
|
|
*/
|
OpenZFS 8585 - improve batching done in zil_commit()
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>
Problem
=======
The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:
1. When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.
As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).
To further complicate the situation, these two issues result in the
following side effect:
3. If a given batch takes longer to complete than normal, this results
in larger batch sizes, which then take longer to complete and
further drive up the latency of zil_commit(). This can occur for a
number of reasons, including (but not limited to): transient changes
in the workload, and storage latency irregularites.
Solution
========
The solution attempted by this change has the following goals:
1. no on-disk changes; maintain current on-disk format.
2. modify the "batch size" to be equal to the "ZIL block size".
3. allow new batches to be generated and issued to disk, while there's
already batches being serviced by the disk.
4. allow zil_commit() to wait for as few ZIL blocks as possible.
5. use as few ZIL blocks as possible, for the same amount of ZIL
transactions, without introducing significant latency to any
individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.
In theory, with these goals met, the new allgorithm will allow the
following improvements:
1. new ZIL blocks can be generated and issued, while there's already
oustanding ZIL blocks being serviced by the storage.
2. the latency of zil_commit() should be proportional to the underlying
storage latency, rather than the incoming synchronous workload.
Porting Notes
=============
Due to the changes made in commit 119a394ab0, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).
To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:
* A list of itxs was added to the lwb structure. This list contains
all of the itxs that have been committed to the lwb, such that the
callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
after the data for the itxs is committed to disk.
* A list of itxs was added on the stack of the zil_process_commit_list()
function; the "nolwb_itxs" list. In some circumstances, an itx may
not be committed to an lwb (e.g. if allocating the "next" ZIL block
on disk fails), so this list is used to keep track of which itxs
fall into this state, such that their callbacks can be called after
the ZIL's writer pipeline is "stalled".
* The logic to actually call the itx's callback was moved into the
zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
were effectively performing the same logic (i.e. if callback is
non-null, call the callback), it seemed like useful code cleanup to
consolidate this logic into a single function.
Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:
* The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
removed from "trace_zil.h" as well.
* Some of the zilog structure's fields were removed, which affected
the tracepoint definitions of the structure.
* New tracepoints had to be added for the following 3 new probes:
* zil__process__commit__itx
* zil__process__normal__itx
* zil__commit__io__error
OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5d95a3a
Closes #6566
2017-12-05 20:39:16 +03:00
|
|
|
#define TX_COMMIT 0 /* Commit marker (no on-disk state) */
|
2008-11-20 23:01:55 +03:00
|
|
|
#define TX_CREATE 1 /* Create file */
|
|
|
|
#define TX_MKDIR 2 /* Make directory */
|
|
|
|
#define TX_MKXATTR 3 /* Make XATTR directory */
|
|
|
|
#define TX_SYMLINK 4 /* Create symbolic link to a file */
|
|
|
|
#define TX_REMOVE 5 /* Remove file */
|
|
|
|
#define TX_RMDIR 6 /* Remove directory */
|
|
|
|
#define TX_LINK 7 /* Create hard link to a file */
|
|
|
|
#define TX_RENAME 8 /* Rename a file */
|
|
|
|
#define TX_WRITE 9 /* File write */
|
|
|
|
#define TX_TRUNCATE 10 /* Truncate a file */
|
|
|
|
#define TX_SETATTR 11 /* Set file attributes */
|
|
|
|
#define TX_ACL_V0 12 /* Set old formatted ACL */
|
|
|
|
#define TX_ACL 13 /* Set ACL */
|
|
|
|
#define TX_CREATE_ACL 14 /* create with ACL */
|
|
|
|
#define TX_CREATE_ATTR 15 /* create + attrs */
|
|
|
|
#define TX_CREATE_ACL_ATTR 16 /* create with ACL + attrs */
|
|
|
|
#define TX_MKDIR_ACL 17 /* mkdir with ACL */
|
|
|
|
#define TX_MKDIR_ATTR 18 /* mkdir with attr */
|
|
|
|
#define TX_MKDIR_ACL_ATTR 19 /* mkdir with ACL + attrs */
|
2010-05-29 00:45:14 +04:00
|
|
|
#define TX_WRITE2 20 /* dmu_sync EALREADY write */
|
|
|
|
#define TX_MAX_TYPE 21 /* Max transaction type */
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The transactions for mkdir, symlink, remove, rmdir, link, and rename
|
|
|
|
* may have the following bit set, indicating the original request
|
|
|
|
* specified case-insensitive handling of names.
|
|
|
|
*/
|
|
|
|
#define TX_CI ((uint64_t)0x1 << 63) /* case-insensitive behavior requested */
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Transactions for write, truncate, setattr, acl_v0, and acl can be logged
|
|
|
|
* out of order. For convenience in the code, all such records must have
|
|
|
|
* lr_foid at the same offset.
|
|
|
|
*/
|
|
|
|
#define TX_OOO(txtype) \
|
|
|
|
((txtype) == TX_WRITE || \
|
|
|
|
(txtype) == TX_TRUNCATE || \
|
|
|
|
(txtype) == TX_SETATTR || \
|
|
|
|
(txtype) == TX_ACL_V0 || \
|
|
|
|
(txtype) == TX_ACL || \
|
|
|
|
(txtype) == TX_WRITE2)
|
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
/*
|
|
|
|
* The number of dnode slots consumed by the object is stored in the 8
|
|
|
|
* unused upper bits of the object ID. We subtract 1 from the value
|
|
|
|
* stored on disk for compatibility with implementations that don't
|
|
|
|
* support large dnodes. The slot count for a single-slot dnode will
|
|
|
|
* contain 0 for those bits to preserve the log record format for
|
|
|
|
* "small" dnodes.
|
|
|
|
*/
|
|
|
|
#define LR_FOID_GET_SLOTS(oid) (BF64_GET((oid), 56, 8) + 1)
|
|
|
|
#define LR_FOID_SET_SLOTS(oid, x) BF64_SET((oid), 56, 8, (x) - 1)
|
|
|
|
#define LR_FOID_GET_OBJ(oid) BF64_GET((oid), 0, DN_MAX_OBJECT_SHIFT)
|
|
|
|
#define LR_FOID_SET_OBJ(oid, x) BF64_SET((oid), 0, DN_MAX_OBJECT_SHIFT, (x))
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Format of log records.
|
|
|
|
* The fields are carefully defined to allow them to be aligned
|
|
|
|
* and sized the same on sparc & intel architectures.
|
|
|
|
* Each log record has a common structure at the beginning.
|
|
|
|
*
|
2010-08-27 01:24:34 +04:00
|
|
|
* The log record on disk (lrc_seq) holds the sequence number of all log
|
|
|
|
* records which is used to ensure we don't replay the same record.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
typedef struct { /* common log record header */
|
|
|
|
uint64_t lrc_txtype; /* intent log transaction type */
|
|
|
|
uint64_t lrc_reclen; /* transaction record length */
|
|
|
|
uint64_t lrc_txg; /* dmu transaction group number */
|
|
|
|
uint64_t lrc_seq; /* see comment above */
|
|
|
|
} lr_t;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Common start of all out-of-order record types (TX_OOO() above).
|
|
|
|
*/
|
|
|
|
typedef struct {
|
|
|
|
lr_t lr_common; /* common portion of log record */
|
|
|
|
uint64_t lr_foid; /* object id */
|
|
|
|
} lr_ooo_t;
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Handle option extended vattr attributes.
|
|
|
|
*
|
|
|
|
* Whenever new attributes are added the version number
|
|
|
|
* will need to be updated as will code in
|
|
|
|
* zfs_log.c and zfs_replay.c
|
|
|
|
*/
|
|
|
|
typedef struct {
|
|
|
|
uint32_t lr_attr_masksize; /* number of elements in array */
|
|
|
|
uint32_t lr_attr_bitmap; /* First entry of array */
|
|
|
|
/* remainder of array and any additional fields */
|
|
|
|
} lr_attr_t;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* log record for creates without optional ACL.
|
|
|
|
* This log record does support optional xvattr_t attributes.
|
|
|
|
*/
|
|
|
|
typedef struct {
|
|
|
|
lr_t lr_common; /* common portion of log record */
|
|
|
|
uint64_t lr_doid; /* object id of directory */
|
|
|
|
uint64_t lr_foid; /* object id of created file object */
|
|
|
|
uint64_t lr_mode; /* mode of object */
|
|
|
|
uint64_t lr_uid; /* uid of object */
|
|
|
|
uint64_t lr_gid; /* gid of object */
|
|
|
|
uint64_t lr_gen; /* generation (txg of creation) */
|
|
|
|
uint64_t lr_crtime[2]; /* creation time */
|
|
|
|
uint64_t lr_rdev; /* rdev of object to create */
|
|
|
|
/* name of object to create follows this */
|
|
|
|
/* for symlinks, link content follows name */
|
|
|
|
/* for creates with xvattr data, the name follows the xvattr info */
|
|
|
|
} lr_create_t;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* FUID ACL record will be an array of ACEs from the original ACL.
|
|
|
|
* If this array includes ephemeral IDs, the record will also include
|
|
|
|
* an array of log-specific FUIDs to replace the ephemeral IDs.
|
|
|
|
* Only one copy of each unique domain will be present, so the log-specific
|
|
|
|
* FUIDs will use an index into a compressed domain table. On replay this
|
|
|
|
* information will be used to construct real FUIDs (and bypass idmap,
|
|
|
|
* since it may not be available).
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Log record for creates with optional ACL
|
|
|
|
* This log record is also used for recording any FUID
|
|
|
|
* information needed for replaying the create. If the
|
|
|
|
* file doesn't have any actual ACEs then the lr_aclcnt
|
|
|
|
* would be zero.
|
2013-06-11 21:12:34 +04:00
|
|
|
*
|
|
|
|
* After lr_acl_flags, there are a lr_acl_bytes number of variable sized ace's.
|
|
|
|
* If create is also setting xvattr's, then acl data follows xvattr.
|
|
|
|
* If ACE FUIDs are needed then they will follow the xvattr_t. Following
|
|
|
|
* the FUIDs will be the domain table information. The FUIDs for the owner
|
|
|
|
* and group will be in lr_create. Name follows ACL data.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
typedef struct {
|
|
|
|
lr_create_t lr_create; /* common create portion */
|
|
|
|
uint64_t lr_aclcnt; /* number of ACEs in ACL */
|
|
|
|
uint64_t lr_domcnt; /* number of unique domains */
|
|
|
|
uint64_t lr_fuidcnt; /* number of real fuids */
|
|
|
|
uint64_t lr_acl_bytes; /* number of bytes in ACL */
|
|
|
|
uint64_t lr_acl_flags; /* ACL flags */
|
|
|
|
} lr_acl_create_t;
|
|
|
|
|
|
|
|
typedef struct {
|
|
|
|
lr_t lr_common; /* common portion of log record */
|
|
|
|
uint64_t lr_doid; /* obj id of directory */
|
|
|
|
/* name of object to remove follows this */
|
|
|
|
} lr_remove_t;
|
|
|
|
|
|
|
|
typedef struct {
|
|
|
|
lr_t lr_common; /* common portion of log record */
|
|
|
|
uint64_t lr_doid; /* obj id of directory */
|
|
|
|
uint64_t lr_link_obj; /* obj id of link */
|
|
|
|
/* name of object to link follows this */
|
|
|
|
} lr_link_t;
|
|
|
|
|
|
|
|
typedef struct {
|
|
|
|
lr_t lr_common; /* common portion of log record */
|
|
|
|
uint64_t lr_sdoid; /* obj id of source directory */
|
|
|
|
uint64_t lr_tdoid; /* obj id of target directory */
|
|
|
|
/* 2 strings: names of source and destination follow this */
|
|
|
|
} lr_rename_t;
|
|
|
|
|
|
|
|
typedef struct {
|
|
|
|
lr_t lr_common; /* common portion of log record */
|
|
|
|
uint64_t lr_foid; /* file object to write */
|
|
|
|
uint64_t lr_offset; /* offset to write to */
|
|
|
|
uint64_t lr_length; /* user data length to write */
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t lr_blkoff; /* no longer used */
|
2008-11-20 23:01:55 +03:00
|
|
|
blkptr_t lr_blkptr; /* spa block pointer for replay */
|
|
|
|
/* write data will follow for small writes */
|
|
|
|
} lr_write_t;
|
|
|
|
|
|
|
|
typedef struct {
|
|
|
|
lr_t lr_common; /* common portion of log record */
|
|
|
|
uint64_t lr_foid; /* object id of file to truncate */
|
|
|
|
uint64_t lr_offset; /* offset to truncate from */
|
|
|
|
uint64_t lr_length; /* length to truncate */
|
|
|
|
} lr_truncate_t;
|
|
|
|
|
|
|
|
typedef struct {
|
|
|
|
lr_t lr_common; /* common portion of log record */
|
|
|
|
uint64_t lr_foid; /* file object to change attributes */
|
|
|
|
uint64_t lr_mask; /* mask of attributes to set */
|
|
|
|
uint64_t lr_mode; /* mode to set */
|
|
|
|
uint64_t lr_uid; /* uid to set */
|
|
|
|
uint64_t lr_gid; /* gid to set */
|
|
|
|
uint64_t lr_size; /* size to set */
|
|
|
|
uint64_t lr_atime[2]; /* access time */
|
|
|
|
uint64_t lr_mtime[2]; /* modification time */
|
|
|
|
/* optional attribute lr_attr_t may be here */
|
|
|
|
} lr_setattr_t;
|
|
|
|
|
|
|
|
typedef struct {
|
|
|
|
lr_t lr_common; /* common portion of log record */
|
|
|
|
uint64_t lr_foid; /* obj id of file */
|
|
|
|
uint64_t lr_aclcnt; /* number of acl entries */
|
|
|
|
/* lr_aclcnt number of ace_t entries follow this */
|
|
|
|
} lr_acl_v0_t;
|
|
|
|
|
|
|
|
typedef struct {
|
|
|
|
lr_t lr_common; /* common portion of log record */
|
|
|
|
uint64_t lr_foid; /* obj id of file */
|
|
|
|
uint64_t lr_aclcnt; /* number of ACEs in ACL */
|
|
|
|
uint64_t lr_domcnt; /* number of unique domains */
|
|
|
|
uint64_t lr_fuidcnt; /* number of real fuids */
|
|
|
|
uint64_t lr_acl_bytes; /* number of bytes in ACL */
|
|
|
|
uint64_t lr_acl_flags; /* ACL flags */
|
|
|
|
/* lr_acl_bytes number of variable sized ace's follows */
|
|
|
|
} lr_acl_t;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ZIL structure definitions, interface function prototype and globals.
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
2009-07-03 02:44:48 +04:00
|
|
|
* Writes are handled in three different ways:
|
|
|
|
*
|
|
|
|
* WR_INDIRECT:
|
|
|
|
* In this mode, if we need to commit the write later, then the block
|
|
|
|
* is immediately written into the file system (using dmu_sync),
|
|
|
|
* and a pointer to the block is put into the log record.
|
|
|
|
* When the txg commits the block is linked in.
|
|
|
|
* This saves additionally writing the data into the log record.
|
|
|
|
* There are a few requirements for this to occur:
|
|
|
|
* - write is greater than zfs/zvol_immediate_write_sz
|
|
|
|
* - not using slogs (as slogs are assumed to always be faster
|
|
|
|
* than writing into the main pool)
|
|
|
|
* - the write occupies only one block
|
|
|
|
* WR_COPIED:
|
|
|
|
* If we know we'll immediately be committing the
|
2019-11-21 20:32:57 +03:00
|
|
|
* transaction (O_SYNC or O_DSYNC), then we allocate a larger
|
2009-07-03 02:44:48 +04:00
|
|
|
* log record here for the data and copy the data in.
|
|
|
|
* WR_NEED_COPY:
|
|
|
|
* Otherwise we don't allocate a buffer, and *if* we need to
|
|
|
|
* flush the write later then a buffer is allocated and
|
|
|
|
* we retrieve the data using the dmu.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
typedef enum {
|
|
|
|
WR_INDIRECT, /* indirect - a large write (dmu_sync() data */
|
|
|
|
/* and put blkptr in log, rather than actual data) */
|
|
|
|
WR_COPIED, /* immediate - data is copied into lr_write_t */
|
|
|
|
WR_NEED_COPY, /* immediate - data needs to be copied if pushed */
|
2010-05-29 00:45:14 +04:00
|
|
|
WR_NUM_STATES /* number of states */
|
2008-11-20 23:01:55 +03:00
|
|
|
} itx_wr_state_t;
|
|
|
|
|
Only commit the ZIL once in zpl_writepages() (msync() case).
Currently, using msync() results in the following code path:
sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage
In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.
This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.
The solution implemented here is composed of two parts:
- I added a new callback system to the ZIL, which allows the caller to
be notified when its ITX gets written to stable storage. One nice
thing is that the callback is called not only in zil_commit() but
in zil_sync() as well, which means that the caller doesn't have to
care whether the write ended up in the ZIL or the DMU: it will get
notified as soon as it's safe, period. This is an improvement over
dmu_tx_callback_register() that was used previously, which only
supports DMU writes. The rationale for this change is to allow
zpl_putpage() to be notified when a ZIL commit is completed without
having to block on zil_commit() itself.
- zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
from issuing ZIL commits. zpl_writepages() will issue the commit
itself instead of relying on zpl_putpage() to do it, thus nicely
batching the writes. Note, however, that we still have to call
write_cache_pages() again in SYNC mode because there is an edge case
documented in the implementation of write_cache_pages() whereas it
will not give us all dirty pages when running in non-SYNC mode. Thus
we need to run it at least once in SYNC mode to make sure we honor
persistency guarantees. This only happens when the pages are
modified at the same time msync() is running, which should be rare.
In most cases there won't be any additional pages and this second
call will do nothing.
Note that this change also fixes a bug related to #907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete. The new
callback system fixes that problem.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1849
Closes #907
2013-11-10 19:00:11 +04:00
|
|
|
typedef void (*zil_callback_t)(void *data);
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
typedef struct itx {
|
|
|
|
list_node_t itx_node; /* linkage on zl_itx_list */
|
|
|
|
void *itx_private; /* type-specific opaque data */
|
|
|
|
itx_wr_state_t itx_wr_state; /* write state */
|
|
|
|
uint8_t itx_sync; /* synchronous transaction */
|
Only commit the ZIL once in zpl_writepages() (msync() case).
Currently, using msync() results in the following code path:
sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage
In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.
This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.
The solution implemented here is composed of two parts:
- I added a new callback system to the ZIL, which allows the caller to
be notified when its ITX gets written to stable storage. One nice
thing is that the callback is called not only in zil_commit() but
in zil_sync() as well, which means that the caller doesn't have to
care whether the write ended up in the ZIL or the DMU: it will get
notified as soon as it's safe, period. This is an improvement over
dmu_tx_callback_register() that was used previously, which only
supports DMU writes. The rationale for this change is to allow
zpl_putpage() to be notified when a ZIL commit is completed without
having to block on zil_commit() itself.
- zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
from issuing ZIL commits. zpl_writepages() will issue the commit
itself instead of relying on zpl_putpage() to do it, thus nicely
batching the writes. Note, however, that we still have to call
write_cache_pages() again in SYNC mode because there is an edge case
documented in the implementation of write_cache_pages() whereas it
will not give us all dirty pages when running in non-SYNC mode. Thus
we need to run it at least once in SYNC mode to make sure we honor
persistency guarantees. This only happens when the pages are
modified at the same time msync() is running, which should be rare.
In most cases there won't be any additional pages and this second
call will do nothing.
Note that this change also fixes a bug related to #907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete. The new
callback system fixes that problem.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1849
Closes #907
2013-11-10 19:00:11 +04:00
|
|
|
zil_callback_t itx_callback; /* Called when the itx is persistent */
|
|
|
|
void *itx_callback_data; /* User data for the callback */
|
2017-12-04 22:44:39 +03:00
|
|
|
size_t itx_size; /* allocated itx structure size */
|
2010-08-27 01:24:34 +04:00
|
|
|
uint64_t itx_oid; /* object id */
|
2008-11-20 23:01:55 +03:00
|
|
|
lr_t itx_lr; /* common part of log record */
|
|
|
|
/* followed by type-specific part of lr_xx_t and its immediate data */
|
|
|
|
} itx_t;
|
|
|
|
|
2012-06-15 18:22:14 +04:00
|
|
|
/*
|
|
|
|
* Used for zil kstat.
|
|
|
|
*/
|
|
|
|
typedef struct zil_stats {
|
|
|
|
/*
|
|
|
|
* Number of times a ZIL commit (e.g. fsync) has been requested.
|
|
|
|
*/
|
|
|
|
kstat_named_t zil_commit_count;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Number of times the ZIL has been flushed to stable storage.
|
|
|
|
* This is less than zil_commit_count when commits are "merged"
|
|
|
|
* (see the documentation above zil_commit()).
|
|
|
|
*/
|
|
|
|
kstat_named_t zil_commit_writer_count;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Number of transactions (reads, writes, renames, etc.)
|
2019-08-30 19:53:15 +03:00
|
|
|
* that have been committed.
|
2012-06-15 18:22:14 +04:00
|
|
|
*/
|
|
|
|
kstat_named_t zil_itx_count;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* See the documentation for itx_wr_state_t above.
|
|
|
|
* Note that "bytes" accumulates the length of the transactions
|
|
|
|
* (i.e. data), not the actual log record sizes.
|
|
|
|
*/
|
|
|
|
kstat_named_t zil_itx_indirect_count;
|
|
|
|
kstat_named_t zil_itx_indirect_bytes;
|
|
|
|
kstat_named_t zil_itx_copied_count;
|
|
|
|
kstat_named_t zil_itx_copied_bytes;
|
|
|
|
kstat_named_t zil_itx_needcopy_count;
|
|
|
|
kstat_named_t zil_itx_needcopy_bytes;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Transactions which have been allocated to the "normal"
|
|
|
|
* (i.e. not slog) storage pool. Note that "bytes" accumulate
|
|
|
|
* the actual log record sizes - which do not include the actual
|
|
|
|
* data in case of indirect writes.
|
|
|
|
*/
|
|
|
|
kstat_named_t zil_itx_metaslab_normal_count;
|
|
|
|
kstat_named_t zil_itx_metaslab_normal_bytes;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Transactions which have been allocated to the "slog" storage pool.
|
|
|
|
* If there are no separate log devices, this is the same as the
|
|
|
|
* "normal" pool.
|
|
|
|
*/
|
|
|
|
kstat_named_t zil_itx_metaslab_slog_count;
|
|
|
|
kstat_named_t zil_itx_metaslab_slog_bytes;
|
|
|
|
} zil_stats_t;
|
|
|
|
|
|
|
|
extern zil_stats_t zil_stats;
|
|
|
|
|
2013-11-01 23:26:11 +04:00
|
|
|
#define ZIL_STAT_INCR(stat, val) \
|
2012-06-15 18:22:14 +04:00
|
|
|
atomic_add_64(&zil_stats.stat.value.ui64, (val));
|
2013-11-01 23:26:11 +04:00
|
|
|
#define ZIL_STAT_BUMP(stat) \
|
2012-06-15 18:22:14 +04:00
|
|
|
ZIL_STAT_INCR(stat, 1);
|
|
|
|
|
2020-10-09 19:34:54 +03:00
|
|
|
typedef int zil_parse_blk_func_t(zilog_t *zilog, const blkptr_t *bp, void *arg,
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t txg);
|
2020-10-09 19:34:54 +03:00
|
|
|
typedef int zil_parse_lr_func_t(zilog_t *zilog, const lr_t *lr, void *arg,
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t txg);
|
2017-10-27 22:46:35 +03:00
|
|
|
typedef int zil_replay_func_t(void *arg1, void *arg2, boolean_t byteswap);
|
OpenZFS 8585 - improve batching done in zil_commit()
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>
Problem
=======
The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:
1. When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.
As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).
To further complicate the situation, these two issues result in the
following side effect:
3. If a given batch takes longer to complete than normal, this results
in larger batch sizes, which then take longer to complete and
further drive up the latency of zil_commit(). This can occur for a
number of reasons, including (but not limited to): transient changes
in the workload, and storage latency irregularites.
Solution
========
The solution attempted by this change has the following goals:
1. no on-disk changes; maintain current on-disk format.
2. modify the "batch size" to be equal to the "ZIL block size".
3. allow new batches to be generated and issued to disk, while there's
already batches being serviced by the disk.
4. allow zil_commit() to wait for as few ZIL blocks as possible.
5. use as few ZIL blocks as possible, for the same amount of ZIL
transactions, without introducing significant latency to any
individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.
In theory, with these goals met, the new allgorithm will allow the
following improvements:
1. new ZIL blocks can be generated and issued, while there's already
oustanding ZIL blocks being serviced by the storage.
2. the latency of zil_commit() should be proportional to the underlying
storage latency, rather than the incoming synchronous workload.
Porting Notes
=============
Due to the changes made in commit 119a394ab0, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).
To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:
* A list of itxs was added to the lwb structure. This list contains
all of the itxs that have been committed to the lwb, such that the
callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
after the data for the itxs is committed to disk.
* A list of itxs was added on the stack of the zil_process_commit_list()
function; the "nolwb_itxs" list. In some circumstances, an itx may
not be committed to an lwb (e.g. if allocating the "next" ZIL block
on disk fails), so this list is used to keep track of which itxs
fall into this state, such that their callbacks can be called after
the ZIL's writer pipeline is "stalled".
* The logic to actually call the itx's callback was moved into the
zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
were effectively performing the same logic (i.e. if callback is
non-null, call the callback), it seemed like useful code cleanup to
consolidate this logic into a single function.
Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:
* The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
removed from "trace_zil.h" as well.
* Some of the zilog structure's fields were removed, which affected
the tracepoint definitions of the structure.
* New tracepoints had to be added for the following 3 new probes:
* zil__process__commit__itx
* zil__process__normal__itx
* zil__commit__io__error
OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5d95a3a
Closes #6566
2017-12-05 20:39:16 +03:00
|
|
|
typedef int zil_get_data_t(void *arg, lr_write_t *lr, char *dbuf,
|
|
|
|
struct lwb *lwb, zio_t *zio);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
extern int zil_parse(zilog_t *zilog, zil_parse_blk_func_t *parse_blk_func,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
zil_parse_lr_func_t *parse_lr_func, void *arg, uint64_t txg,
|
|
|
|
boolean_t decrypt);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
extern void zil_init(void);
|
|
|
|
extern void zil_fini(void);
|
|
|
|
|
|
|
|
extern zilog_t *zil_alloc(objset_t *os, zil_header_t *zh_phys);
|
|
|
|
extern void zil_free(zilog_t *zilog);
|
|
|
|
|
|
|
|
extern zilog_t *zil_open(objset_t *os, zil_get_data_t *get_data);
|
|
|
|
extern void zil_close(zilog_t *zilog);
|
|
|
|
|
2009-01-16 00:59:39 +03:00
|
|
|
extern void zil_replay(objset_t *os, void *arg,
|
2017-10-27 22:46:35 +03:00
|
|
|
zil_replay_func_t *replay_func[TX_MAX_TYPE]);
|
2010-05-29 00:45:14 +04:00
|
|
|
extern boolean_t zil_replaying(zilog_t *zilog, dmu_tx_t *tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
extern void zil_destroy(zilog_t *zilog, boolean_t keep_first);
|
2012-12-15 04:13:40 +04:00
|
|
|
extern void zil_destroy_sync(zilog_t *zilog, dmu_tx_t *tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
extern itx_t *zil_itx_create(uint64_t txtype, size_t lrsize);
|
2010-05-29 00:45:14 +04:00
|
|
|
extern void zil_itx_destroy(itx_t *itx);
|
2010-08-27 01:24:34 +04:00
|
|
|
extern void zil_itx_assign(zilog_t *zilog, itx_t *itx, dmu_tx_t *tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-10-11 01:39:44 +03:00
|
|
|
extern void zil_async_to_sync(zilog_t *zilog, uint64_t oid);
|
2010-08-27 01:24:34 +04:00
|
|
|
extern void zil_commit(zilog_t *zilog, uint64_t oid);
|
OpenZFS 8909 - 8585 can cause a use-after-free kernel panic
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>
PROBLEM
=======
There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.
Here's an example panic due to this bug:
> ::status
debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference
dump content: kernel pages only
> $c
zio_shrink+0x12()
zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit+0x80(ffffff03dcd15cc0, 9a9)
zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
write+0x250(42, fffffd7ff4832000, 2000)
sys_syscall+0x177()
If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.
A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.
This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).
Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.
This race *is* present in `zil_suspend`, leading to this bug.
At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.
This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnecessary.
So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.
SOLUTION
========
The solution to this, is to ensure all outstanding lwb's complete before
calling `zil_free_lwb` via `zil_destroy` in `zil_suspend`. This patch
accomplishes this goal by forcing the normal `zil_commit` logic when
called from `zil_sync`.
Now, `zil_suspend` will call `zil_commit_impl` which will always use the
normal logic of waiting/issuing lwb's to disk before it returns. As a
result, any lwb's outstanding when `zil_commit_impl` is called will be
guaranteed to reach the `LWB_STATE_DONE` state by the time it returns.
Further, no new lwb's will be created via `zil_commit` since the zilog's
`zl_suspend` flag will be set. This will force all new callers of
`zil_commit` to use `txg_wait_synced` instead of creating and issuing
new lwb's.
Thus, all lwb's left on the zilog's lwb list when `zil_destroy` is
called will be in the `LWB_STATE_DONE` state, and we'll avoid this race
condition.
OpenZFS-issue: https://www.illumos.org/issues/8909
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ece62b6f8d
Closes #6940
2017-12-07 22:26:32 +03:00
|
|
|
extern void zil_commit_impl(zilog_t *zilog, uint64_t oid);
|
2020-06-11 23:38:25 +03:00
|
|
|
extern void zil_remove_async(zilog_t *zilog, uint64_t oid);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
extern int zil_reset(const char *osname, void *txarg);
|
2015-05-06 19:07:55 +03:00
|
|
|
extern int zil_claim(struct dsl_pool *dp,
|
|
|
|
struct dsl_dataset *ds, void *txarg);
|
|
|
|
extern int zil_check_log_chain(struct dsl_pool *dp,
|
|
|
|
struct dsl_dataset *ds, void *tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
extern void zil_sync(zilog_t *zilog, dmu_tx_t *tx);
|
2010-08-27 01:24:34 +04:00
|
|
|
extern void zil_clean(zilog_t *zilog, uint64_t synced_txg);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
extern int zil_suspend(const char *osname, void **cookiep);
|
|
|
|
extern void zil_resume(void *cookie);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
OpenZFS 8585 - improve batching done in zil_commit()
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>
Problem
=======
The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:
1. When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.
As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).
To further complicate the situation, these two issues result in the
following side effect:
3. If a given batch takes longer to complete than normal, this results
in larger batch sizes, which then take longer to complete and
further drive up the latency of zil_commit(). This can occur for a
number of reasons, including (but not limited to): transient changes
in the workload, and storage latency irregularites.
Solution
========
The solution attempted by this change has the following goals:
1. no on-disk changes; maintain current on-disk format.
2. modify the "batch size" to be equal to the "ZIL block size".
3. allow new batches to be generated and issued to disk, while there's
already batches being serviced by the disk.
4. allow zil_commit() to wait for as few ZIL blocks as possible.
5. use as few ZIL blocks as possible, for the same amount of ZIL
transactions, without introducing significant latency to any
individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.
In theory, with these goals met, the new allgorithm will allow the
following improvements:
1. new ZIL blocks can be generated and issued, while there's already
oustanding ZIL blocks being serviced by the storage.
2. the latency of zil_commit() should be proportional to the underlying
storage latency, rather than the incoming synchronous workload.
Porting Notes
=============
Due to the changes made in commit 119a394ab0, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).
To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:
* A list of itxs was added to the lwb structure. This list contains
all of the itxs that have been committed to the lwb, such that the
callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
after the data for the itxs is committed to disk.
* A list of itxs was added on the stack of the zil_process_commit_list()
function; the "nolwb_itxs" list. In some circumstances, an itx may
not be committed to an lwb (e.g. if allocating the "next" ZIL block
on disk fails), so this list is used to keep track of which itxs
fall into this state, such that their callbacks can be called after
the ZIL's writer pipeline is "stalled".
* The logic to actually call the itx's callback was moved into the
zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
were effectively performing the same logic (i.e. if callback is
non-null, call the callback), it seemed like useful code cleanup to
consolidate this logic into a single function.
Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:
* The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
removed from "trace_zil.h" as well.
* Some of the zilog structure's fields were removed, which affected
the tracepoint definitions of the structure.
* New tracepoints had to be added for the following 3 new probes:
* zil__process__commit__itx
* zil__process__normal__itx
* zil__commit__io__error
OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5d95a3a
Closes #6566
2017-12-05 20:39:16 +03:00
|
|
|
extern void zil_lwb_add_block(struct lwb *lwb, const blkptr_t *bp);
|
|
|
|
extern void zil_lwb_add_txg(struct lwb *lwb, uint64_t txg);
|
2010-05-29 00:45:14 +04:00
|
|
|
extern int zil_bp_tree_add(zilog_t *zilog, const blkptr_t *bp);
|
|
|
|
|
|
|
|
extern void zil_set_sync(zilog_t *zilog, uint64_t syncval);
|
|
|
|
|
|
|
|
extern void zil_set_logbias(zilog_t *zilog, uint64_t slogval);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-06-10 21:48:42 +03:00
|
|
|
extern uint64_t zil_max_copied_data(zilog_t *zilog);
|
|
|
|
extern uint64_t zil_max_log_data(zilog_t *zilog);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
extern int zil_replay_disable;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
#ifdef __cplusplus
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
|
|
|
#endif /* _SYS_ZIL_H */
|