2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* CDDL HEADER START
|
|
|
|
*
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
*
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
2022-07-12 00:16:13 +03:00
|
|
|
* or https://opensource.org/licenses/CDDL-1.0.
|
2008-11-20 23:01:55 +03:00
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
* and limitations under the License.
|
|
|
|
*
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
*
|
|
|
|
* CDDL HEADER END
|
|
|
|
*/
|
2017-02-03 01:13:41 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
* Copyright (c) 2012, 2020 by Delphix. All rights reserved.
|
2013-08-02 00:02:10 +04:00
|
|
|
* Copyright (c) 2013 by Saso Kiselkov. All rights reserved.
|
2015-04-01 16:07:48 +03:00
|
|
|
* Copyright (c) 2013, Joyent, Inc. All rights reserved.
|
2015-04-02 06:44:32 +03:00
|
|
|
* Copyright (c) 2014 Spectra Logic Corporation, All rights reserved.
|
2015-05-06 19:07:55 +03:00
|
|
|
* Copyright (c) 2015, STRATO AG, Inc. All rights reserved.
|
2014-03-22 13:07:14 +04:00
|
|
|
* Copyright (c) 2016 Actifio, Inc. All rights reserved.
|
2017-02-03 01:13:41 +03:00
|
|
|
* Copyright 2017 Nexenta Systems, Inc.
|
2017-11-11 00:37:10 +03:00
|
|
|
* Copyright (c) 2017 Open-E, Inc. All Rights Reserved.
|
2019-02-09 02:44:15 +03:00
|
|
|
* Copyright (c) 2018, loli10K <ezomori.nozomu@gmail.com>. All rights reserved.
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
* Copyright (c) 2019, Klara Inc.
|
|
|
|
* Copyright (c) 2019, Allan Jude
|
2022-10-20 03:07:51 +03:00
|
|
|
* Copyright (c) 2022 Hewlett Packard Enterprise Development LP.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* Portions Copyright 2010 Robert Milkowski */
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/cred.h>
|
|
|
|
#include <sys/zfs_context.h>
|
|
|
|
#include <sys/dmu_objset.h>
|
|
|
|
#include <sys/dsl_dir.h>
|
|
|
|
#include <sys/dsl_dataset.h>
|
|
|
|
#include <sys/dsl_prop.h>
|
|
|
|
#include <sys/dsl_pool.h>
|
|
|
|
#include <sys/dsl_synctask.h>
|
|
|
|
#include <sys/dsl_deleg.h>
|
|
|
|
#include <sys/dnode.h>
|
|
|
|
#include <sys/dbuf.h>
|
|
|
|
#include <sys/zvol.h>
|
|
|
|
#include <sys/dmu_tx.h>
|
|
|
|
#include <sys/zap.h>
|
|
|
|
#include <sys/zil.h>
|
|
|
|
#include <sys/dmu_impl.h>
|
|
|
|
#include <sys/zfs_ioctl.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/sa.h>
|
2010-08-27 01:24:34 +04:00
|
|
|
#include <sys/zfs_onexit.h>
|
2013-09-04 16:00:57 +04:00
|
|
|
#include <sys/dsl_destroy.h>
|
2015-05-06 19:07:55 +03:00
|
|
|
#include <sys/vdev.h>
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
#include <sys/zfeature.h>
|
2016-06-07 19:16:52 +03:00
|
|
|
#include <sys/policy.h>
|
2016-10-04 21:46:10 +03:00
|
|
|
#include <sys/spa_impl.h>
|
2018-10-10 00:05:13 +03:00
|
|
|
#include <sys/dmu_recv.h>
|
2018-02-14 01:54:54 +03:00
|
|
|
#include <sys/zfs_project.h>
|
2016-09-12 18:15:20 +03:00
|
|
|
#include "zfs_namecheck.h"
|
2021-11-11 23:52:16 +03:00
|
|
|
#include <sys/vdev_impl.h>
|
|
|
|
#include <sys/arc.h>
|
2024-08-27 10:52:33 +03:00
|
|
|
#include <cityhash.h>
|
2010-08-27 01:24:34 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Needed to close a window in dnode_move() that allows the objset to be freed
|
|
|
|
* before it can be safely accessed.
|
|
|
|
*/
|
|
|
|
krwlock_t os_lock;
|
|
|
|
|
2015-05-06 19:07:55 +03:00
|
|
|
/*
|
2017-01-03 20:31:18 +03:00
|
|
|
* Tunable to overwrite the maximum number of threads for the parallelization
|
2015-05-06 19:07:55 +03:00
|
|
|
* of dmu_objset_find_dp, needed to speed up the import of pools with many
|
|
|
|
* datasets.
|
|
|
|
* Default is 4 times the number of leaf vdevs.
|
|
|
|
*/
|
2022-01-15 02:37:55 +03:00
|
|
|
static const int dmu_find_threads = 0;
|
2015-05-06 19:07:55 +03:00
|
|
|
|
2016-05-17 04:02:29 +03:00
|
|
|
/*
|
|
|
|
* Backfill lower metadnode objects after this many have been freed.
|
|
|
|
* Backfilling negatively impacts object creation rates, so only do it
|
|
|
|
* if there are enough holes to fill.
|
|
|
|
*/
|
2022-01-15 02:37:55 +03:00
|
|
|
static const int dmu_rescan_dnode_threshold = 1 << DN_MAX_INDBLKSHIFT;
|
2016-05-17 04:02:29 +03:00
|
|
|
|
2022-01-15 02:37:55 +03:00
|
|
|
static const char *upgrade_tag = "upgrade_tag";
|
2017-11-11 00:37:10 +03:00
|
|
|
|
2015-05-06 19:07:55 +03:00
|
|
|
static void dmu_objset_find_dp_cb(void *arg);
|
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
static void dmu_objset_upgrade(objset_t *os, dmu_objset_upgrade_cb_t cb);
|
|
|
|
static void dmu_objset_upgrade_stop(objset_t *os);
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
void
|
|
|
|
dmu_objset_init(void)
|
|
|
|
{
|
|
|
|
rw_init(&os_lock, NULL, RW_DEFAULT, NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
dmu_objset_fini(void)
|
|
|
|
{
|
|
|
|
rw_destroy(&os_lock);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
spa_t *
|
|
|
|
dmu_objset_spa(objset_t *os)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
return (os->os_spa);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
zilog_t *
|
|
|
|
dmu_objset_zil(objset_t *os)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
return (os->os_zil);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
dsl_pool_t *
|
|
|
|
dmu_objset_pool(objset_t *os)
|
|
|
|
{
|
|
|
|
dsl_dataset_t *ds;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if ((ds = os->os_dsl_dataset) != NULL && ds->ds_dir)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (ds->ds_dir->dd_pool);
|
|
|
|
else
|
2010-05-29 00:45:14 +04:00
|
|
|
return (spa_get_dsl(os->os_spa));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
dsl_dataset_t *
|
|
|
|
dmu_objset_ds(objset_t *os)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
return (os->os_dsl_dataset);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
dmu_objset_type_t
|
|
|
|
dmu_objset_type(objset_t *os)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
return (os->os_phys->os_type);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
dmu_objset_name(objset_t *os, char *buf)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
dsl_dataset_name(os->os_dsl_dataset, buf);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t
|
|
|
|
dmu_objset_id(objset_t *os)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
dsl_dataset_t *ds = os->os_dsl_dataset;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (ds ? ds->ds_object : 0);
|
|
|
|
}
|
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
uint64_t
|
|
|
|
dmu_objset_dnodesize(objset_t *os)
|
|
|
|
{
|
|
|
|
return (os->os_dnodesize);
|
|
|
|
}
|
|
|
|
|
2014-05-23 20:21:07 +04:00
|
|
|
zfs_sync_type_t
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_objset_syncprop(objset_t *os)
|
|
|
|
{
|
|
|
|
return (os->os_sync);
|
|
|
|
}
|
|
|
|
|
2014-05-23 20:21:07 +04:00
|
|
|
zfs_logbias_op_t
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_objset_logbias(objset_t *os)
|
|
|
|
{
|
|
|
|
return (os->os_logbias);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
checksum_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = arg;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Inheritance should have been done by now.
|
|
|
|
*/
|
|
|
|
ASSERT(newval != ZIO_CHECKSUM_INHERIT);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_checksum = zio_checksum_select(newval, ZIO_CHECKSUM_ON_VALUE);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
compression_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = arg;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Inheritance and range checking should have been done by now.
|
|
|
|
*/
|
|
|
|
ASSERT(newval != ZIO_COMPRESS_INHERIT);
|
|
|
|
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
os->os_compress = zio_compress_select(os->os_spa,
|
|
|
|
ZIO_COMPRESS_ALGO(newval), ZIO_COMPRESS_ON);
|
|
|
|
os->os_complevel = zio_complevel_select(os->os_spa, os->os_compress,
|
|
|
|
ZIO_COMPRESS_LEVEL(newval), ZIO_COMPLEVEL_DEFAULT);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
copies_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = arg;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Inheritance and range checking should have been done by now.
|
|
|
|
*/
|
|
|
|
ASSERT(newval > 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(newval <= spa_max_replication(os->os_spa));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_copies = newval;
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dedup_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
|
|
|
objset_t *os = arg;
|
|
|
|
spa_t *spa = os->os_spa;
|
|
|
|
enum zio_checksum checksum;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Inheritance should have been done by now.
|
|
|
|
*/
|
|
|
|
ASSERT(newval != ZIO_CHECKSUM_INHERIT);
|
|
|
|
|
|
|
|
checksum = zio_checksum_dedup_select(spa, newval, ZIO_CHECKSUM_OFF);
|
|
|
|
|
|
|
|
os->os_dedup_checksum = checksum & ZIO_CHECKSUM_MASK;
|
|
|
|
os->os_dedup_verify = !!(checksum & ZIO_CHECKSUM_VERIFY);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
static void
|
|
|
|
primary_cache_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = arg;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Inheritance and range checking should have been done by now.
|
|
|
|
*/
|
|
|
|
ASSERT(newval == ZFS_CACHE_ALL || newval == ZFS_CACHE_NONE ||
|
|
|
|
newval == ZFS_CACHE_METADATA);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_primary_cache = newval;
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
secondary_cache_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = arg;
|
2008-12-03 23:09:06 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Inheritance and range checking should have been done by now.
|
|
|
|
*/
|
|
|
|
ASSERT(newval == ZFS_CACHE_ALL || newval == ZFS_CACHE_NONE ||
|
|
|
|
newval == ZFS_CACHE_METADATA);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_secondary_cache = newval;
|
|
|
|
}
|
|
|
|
|
2023-10-24 21:00:07 +03:00
|
|
|
static void
|
|
|
|
prefetch_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
|
|
|
objset_t *os = arg;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Inheritance should have been done by now.
|
|
|
|
*/
|
|
|
|
ASSERT(newval == ZFS_PREFETCH_ALL || newval == ZFS_PREFETCH_NONE ||
|
|
|
|
newval == ZFS_PREFETCH_METADATA);
|
|
|
|
os->os_prefetch = newval;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
sync_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
|
|
|
objset_t *os = arg;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Inheritance and range checking should have been done by now.
|
|
|
|
*/
|
|
|
|
ASSERT(newval == ZFS_SYNC_STANDARD || newval == ZFS_SYNC_ALWAYS ||
|
|
|
|
newval == ZFS_SYNC_DISABLED);
|
|
|
|
|
|
|
|
os->os_sync = newval;
|
|
|
|
if (os->os_zil)
|
|
|
|
zil_set_sync(os->os_zil, newval);
|
|
|
|
}
|
|
|
|
|
2014-05-23 20:21:07 +04:00
|
|
|
static void
|
|
|
|
redundant_metadata_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
|
|
|
objset_t *os = arg;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Inheritance and range checking should have been done by now.
|
|
|
|
*/
|
|
|
|
ASSERT(newval == ZFS_REDUNDANT_METADATA_ALL ||
|
2022-10-20 03:07:51 +03:00
|
|
|
newval == ZFS_REDUNDANT_METADATA_MOST ||
|
|
|
|
newval == ZFS_REDUNDANT_METADATA_SOME ||
|
|
|
|
newval == ZFS_REDUNDANT_METADATA_NONE);
|
2014-05-23 20:21:07 +04:00
|
|
|
|
|
|
|
os->os_redundant_metadata = newval;
|
|
|
|
}
|
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
static void
|
|
|
|
dnodesize_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
|
|
|
objset_t *os = arg;
|
|
|
|
|
|
|
|
switch (newval) {
|
|
|
|
case ZFS_DNSIZE_LEGACY:
|
|
|
|
os->os_dnodesize = DNODE_MIN_SIZE;
|
|
|
|
break;
|
|
|
|
case ZFS_DNSIZE_AUTO:
|
|
|
|
/*
|
|
|
|
* Choose a dnode size that will work well for most
|
|
|
|
* workloads if the user specified "auto". Future code
|
|
|
|
* improvements could dynamically select a dnode size
|
|
|
|
* based on observed workload patterns.
|
|
|
|
*/
|
|
|
|
os->os_dnodesize = DNODE_MIN_SIZE * 2;
|
|
|
|
break;
|
|
|
|
case ZFS_DNSIZE_1K:
|
|
|
|
case ZFS_DNSIZE_2K:
|
|
|
|
case ZFS_DNSIZE_4K:
|
|
|
|
case ZFS_DNSIZE_8K:
|
|
|
|
case ZFS_DNSIZE_16K:
|
|
|
|
os->os_dnodesize = newval;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-09-06 04:33:36 +03:00
|
|
|
static void
|
|
|
|
smallblk_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
|
|
|
objset_t *os = arg;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Inheritance and range checking should have been done by now.
|
|
|
|
*/
|
2021-01-24 02:45:27 +03:00
|
|
|
ASSERT(newval <= SPA_MAXBLOCKSIZE);
|
2018-09-06 04:33:36 +03:00
|
|
|
ASSERT(ISP2(newval));
|
|
|
|
|
|
|
|
os->os_zpl_special_smallblock = newval;
|
|
|
|
}
|
|
|
|
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
static void
|
|
|
|
direct_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
|
|
|
objset_t *os = arg;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Inheritance and range checking should have been done by now.
|
|
|
|
*/
|
|
|
|
ASSERT(newval == ZFS_DIRECT_DISABLED || newval == ZFS_DIRECT_STANDARD ||
|
|
|
|
newval == ZFS_DIRECT_ALWAYS);
|
|
|
|
|
|
|
|
os->os_direct = newval;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
|
|
|
logbias_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
|
|
|
objset_t *os = arg;
|
|
|
|
|
|
|
|
ASSERT(newval == ZFS_LOGBIAS_LATENCY ||
|
|
|
|
newval == ZFS_LOGBIAS_THROUGHPUT);
|
|
|
|
os->os_logbias = newval;
|
|
|
|
if (os->os_zil)
|
|
|
|
zil_set_logbias(os->os_zil, newval);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
2014-11-03 23:15:08 +03:00
|
|
|
static void
|
|
|
|
recordsize_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
|
|
|
objset_t *os = arg;
|
|
|
|
|
|
|
|
os->os_recordsize = newval;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
|
|
|
dmu_objset_byteswap(void *buf, size_t size)
|
|
|
|
{
|
|
|
|
objset_phys_t *osp = buf;
|
|
|
|
|
2018-02-14 01:54:54 +03:00
|
|
|
ASSERT(size == OBJSET_PHYS_SIZE_V1 || size == OBJSET_PHYS_SIZE_V2 ||
|
|
|
|
size == sizeof (objset_phys_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
dnode_byteswap(&osp->os_meta_dnode);
|
|
|
|
byteswap_uint64_array(&osp->os_zil_header, sizeof (zil_header_t));
|
|
|
|
osp->os_type = BSWAP_64(osp->os_type);
|
2009-07-03 02:44:48 +04:00
|
|
|
osp->os_flags = BSWAP_64(osp->os_flags);
|
2018-02-14 01:54:54 +03:00
|
|
|
if (size >= OBJSET_PHYS_SIZE_V2) {
|
2009-07-03 02:44:48 +04:00
|
|
|
dnode_byteswap(&osp->os_userused_dnode);
|
|
|
|
dnode_byteswap(&osp->os_groupused_dnode);
|
2018-02-14 01:54:54 +03:00
|
|
|
if (size >= sizeof (objset_phys_t))
|
|
|
|
dnode_byteswap(&osp->os_projectused_dnode);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-03-21 04:36:00 +03:00
|
|
|
/*
|
2024-09-07 17:07:14 +03:00
|
|
|
* Runs cityhash on the objset_t pointer and the object number.
|
2017-03-21 04:36:00 +03:00
|
|
|
*/
|
|
|
|
static uint64_t
|
|
|
|
dnode_hash(const objset_t *os, uint64_t obj)
|
|
|
|
{
|
|
|
|
uintptr_t osv = (uintptr_t)os;
|
2024-09-07 17:07:14 +03:00
|
|
|
return (cityhash2((uint64_t)osv, obj));
|
2017-03-21 04:36:00 +03:00
|
|
|
}
|
|
|
|
|
2020-06-15 21:30:37 +03:00
|
|
|
static unsigned int
|
2017-03-21 04:36:00 +03:00
|
|
|
dnode_multilist_index_func(multilist_t *ml, void *obj)
|
|
|
|
{
|
|
|
|
dnode_t *dn = obj;
|
2021-06-29 15:59:14 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The low order bits of the hash value are thought to be
|
|
|
|
* distributed evenly. Otherwise, in the case that the multilist
|
|
|
|
* has a power of two number of sublists, each sublists' usage
|
|
|
|
* would not be evenly distributed. In this context full 64bit
|
|
|
|
* division would be a waste of time, so limit it to 32 bits.
|
|
|
|
*/
|
|
|
|
return ((unsigned int)dnode_hash(dn->dn_objset, dn->dn_object) %
|
2017-03-21 04:36:00 +03:00
|
|
|
multilist_get_num_sublists(ml));
|
|
|
|
}
|
|
|
|
|
2021-11-11 23:52:16 +03:00
|
|
|
static inline boolean_t
|
|
|
|
dmu_os_is_l2cacheable(objset_t *os)
|
|
|
|
{
|
2023-03-07 03:13:05 +03:00
|
|
|
if (os->os_secondary_cache == ZFS_CACHE_ALL ||
|
|
|
|
os->os_secondary_cache == ZFS_CACHE_METADATA) {
|
|
|
|
if (l2arc_exclude_special == 0)
|
|
|
|
return (B_TRUE);
|
|
|
|
|
|
|
|
blkptr_t *bp = os->os_rootbp;
|
|
|
|
if (bp == NULL || BP_IS_HOLE(bp))
|
|
|
|
return (B_FALSE);
|
2021-11-11 23:52:16 +03:00
|
|
|
uint64_t vdev = DVA_GET_VDEV(bp->blk_dva);
|
|
|
|
vdev_t *rvd = os->os_spa->spa_root_vdev;
|
2023-03-07 03:13:05 +03:00
|
|
|
vdev_t *vd = NULL;
|
2021-11-11 23:52:16 +03:00
|
|
|
|
|
|
|
if (vdev < rvd->vdev_children)
|
|
|
|
vd = rvd->vdev_child[vdev];
|
|
|
|
|
2023-03-07 03:13:05 +03:00
|
|
|
if (vd == NULL)
|
|
|
|
return (B_TRUE);
|
2021-11-11 23:52:16 +03:00
|
|
|
|
2023-03-07 03:13:05 +03:00
|
|
|
if (vd->vdev_alloc_bias != VDEV_BIAS_SPECIAL &&
|
|
|
|
vd->vdev_alloc_bias != VDEV_BIAS_DEDUP)
|
|
|
|
return (B_TRUE);
|
2021-11-11 23:52:16 +03:00
|
|
|
}
|
|
|
|
return (B_FALSE);
|
|
|
|
}
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
/*
|
|
|
|
* Instantiates the objset_t in-memory structure corresponding to the
|
|
|
|
* objset_phys_t that's pointed to by the specified blkptr_t.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
int
|
|
|
|
dmu_objset_open_impl(spa_t *spa, dsl_dataset_t *ds, blkptr_t *bp,
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t **osp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os;
|
2008-12-03 23:09:06 +03:00
|
|
|
int i, err;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(ds == NULL || MUTEX_HELD(&ds->ds_opening_lock));
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
ASSERT(!BP_IS_REDACTED(bp));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
dmu_objset_from_ds must be called with dp_config_rwlock held
The normal lock order is that the dp_config_rwlock must be held before
the ds_opening_lock. For example, dmu_objset_hold() does this.
However, dmu_objset_open_impl() is called with the ds_opening_lock held,
and if the dp_config_rwlock is not already held, it will attempt to
acquire it. This may lead to deadlock, since the lock order is
reversed.
Looking at all the callers of dmu_objset_open_impl() (which is
principally the callers of dmu_objset_from_ds()), almost all callers
already have the dp_config_rwlock. However, there are a few places in
the send and receive code paths that do not. For example:
dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream,
receive_write_byref, redact_traverse_thread.
This commit resolves the problem by requiring all callers ot
dmu_objset_from_ds() to hold the dp_config_rwlock. In most cases, the
code has been restructured such that we call dmu_objset_from_ds()
earlier on in the send and receive processes, when we already have the
dp_config_rwlock, and save the objset_t until we need it in the middle
of the send or receive (similar to what we already do with the
dsl_dataset_t). Thus we do not need to acquire the dp_config_rwlock in
many new places.
I also cleaned up code in dmu_redact_snap() and send_traverse_thread().
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9662
Closes #10115
2020-03-12 20:55:02 +03:00
|
|
|
/*
|
|
|
|
* We need the pool config lock to get properties.
|
|
|
|
*/
|
|
|
|
ASSERT(ds == NULL || dsl_pool_config_held(ds->ds_dir->dd_pool));
|
|
|
|
|
OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2016-09-22 19:30:13 +03:00
|
|
|
/*
|
|
|
|
* The $ORIGIN dataset (if it exists) doesn't have an associated
|
|
|
|
* objset, so there's no reason to open it. The $ORIGIN dataset
|
|
|
|
* will not exist on pools older than SPA_VERSION_ORIGIN.
|
|
|
|
*/
|
|
|
|
if (ds != NULL && spa_get_dsl(spa) != NULL &&
|
|
|
|
spa_get_dsl(spa)->dp_origin_snap != NULL) {
|
|
|
|
ASSERT3P(ds->ds_dir, !=,
|
|
|
|
spa_get_dsl(spa)->dp_origin_snap->ds_dir);
|
|
|
|
}
|
|
|
|
|
2014-11-21 03:09:39 +03:00
|
|
|
os = kmem_zalloc(sizeof (objset_t), KM_SLEEP);
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_dsl_dataset = ds;
|
|
|
|
os->os_spa = spa;
|
|
|
|
os->os_rootbp = bp;
|
|
|
|
if (!BP_IS_HOLE(os->os_rootbp)) {
|
2014-12-06 20:24:32 +03:00
|
|
|
arc_flags_t aflags = ARC_FLAG_WAIT;
|
2014-06-25 22:37:59 +04:00
|
|
|
zbookmark_phys_t zb;
|
2018-02-14 01:54:54 +03:00
|
|
|
int size;
|
2022-10-27 19:54:54 +03:00
|
|
|
zio_flag_t zio_flags = ZIO_FLAG_CANFAIL;
|
2010-05-29 00:45:14 +04:00
|
|
|
SET_BOOKMARK(&zb, ds ? ds->ds_object : DMU_META_OBJSET,
|
|
|
|
ZB_ROOT_OBJECT, ZB_ROOT_LEVEL, ZB_ROOT_BLKID);
|
|
|
|
|
2021-11-11 23:52:16 +03:00
|
|
|
if (dmu_os_is_l2cacheable(os))
|
2014-12-06 20:24:32 +03:00
|
|
|
aflags |= ARC_FLAG_L2CACHE;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (ds != NULL && ds->ds_dir->dd_crypto_obj != 0) {
|
|
|
|
ASSERT3U(BP_GET_COMPRESS(bp), ==, ZIO_COMPRESS_OFF);
|
|
|
|
ASSERT(BP_IS_AUTHENTICATED(bp));
|
|
|
|
zio_flags |= ZIO_FLAG_RAW;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
dprintf_bp(os->os_rootbp, "reading %s", "");
|
2013-07-03 00:26:24 +04:00
|
|
|
err = arc_read(NULL, spa, os->os_rootbp,
|
2010-05-29 00:45:14 +04:00
|
|
|
arc_getbuf_func, &os->os_phys_buf,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ZIO_PRIORITY_SYNC_READ, zio_flags, &aflags, &zb);
|
2013-09-04 16:00:57 +04:00
|
|
|
if (err != 0) {
|
2010-05-29 00:45:14 +04:00
|
|
|
kmem_free(os, sizeof (objset_t));
|
2008-12-03 23:09:06 +03:00
|
|
|
/* convert checksum errors into IO errors */
|
|
|
|
if (err == ECKSUM)
|
2013-03-08 22:41:28 +04:00
|
|
|
err = SET_ERROR(EIO);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (err);
|
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2018-02-14 01:54:54 +03:00
|
|
|
if (spa_version(spa) < SPA_VERSION_USERSPACE)
|
|
|
|
size = OBJSET_PHYS_SIZE_V1;
|
|
|
|
else if (!spa_feature_is_enabled(spa,
|
|
|
|
SPA_FEATURE_PROJECT_QUOTA))
|
|
|
|
size = OBJSET_PHYS_SIZE_V2;
|
|
|
|
else
|
|
|
|
size = sizeof (objset_phys_t);
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/* Increase the blocksize if we are permitted. */
|
2018-02-14 01:54:54 +03:00
|
|
|
if (arc_buf_size(os->os_phys_buf) < size) {
|
2016-07-11 20:45:52 +03:00
|
|
|
arc_buf_t *buf = arc_alloc_buf(spa, &os->os_phys_buf,
|
2018-02-14 01:54:54 +03:00
|
|
|
ARC_BUFC_METADATA, size);
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(buf->b_data, 0, size);
|
|
|
|
memcpy(buf->b_data, os->os_phys_buf->b_data,
|
2010-05-29 00:45:14 +04:00
|
|
|
arc_buf_size(os->os_phys_buf));
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_destroy(os->os_phys_buf, &os->os_phys_buf);
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_phys_buf = buf;
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_phys = os->os_phys_buf->b_data;
|
|
|
|
os->os_flags = os->os_phys->os_flags;
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2009-07-03 02:44:48 +04:00
|
|
|
int size = spa_version(spa) >= SPA_VERSION_USERSPACE ?
|
2018-02-14 01:54:54 +03:00
|
|
|
sizeof (objset_phys_t) : OBJSET_PHYS_SIZE_V1;
|
2016-07-11 20:45:52 +03:00
|
|
|
os->os_phys_buf = arc_alloc_buf(spa, &os->os_phys_buf,
|
|
|
|
ARC_BUFC_METADATA, size);
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_phys = os->os_phys_buf->b_data;
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(os->os_phys, 0, size);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2018-07-10 20:49:50 +03:00
|
|
|
/*
|
|
|
|
* These properties will be filled in by the logic in zfs_get_zplprop()
|
|
|
|
* when they are queried for the first time.
|
|
|
|
*/
|
|
|
|
os->os_version = OBJSET_PROP_UNINITIALIZED;
|
|
|
|
os->os_normalization = OBJSET_PROP_UNINITIALIZED;
|
|
|
|
os->os_utf8only = OBJSET_PROP_UNINITIALIZED;
|
|
|
|
os->os_casesensitivity = OBJSET_PROP_UNINITIALIZED;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Note: the changed_cb will be called once before the register
|
|
|
|
* func returns, thus changing the checksum/compression from the
|
2008-12-03 23:09:06 +03:00
|
|
|
* default (fletcher2/off). Snapshots don't need to know about
|
|
|
|
* checksum/compression/copies.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2014-06-06 01:19:08 +04:00
|
|
|
if (ds != NULL) {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
os->os_encrypted = (ds->ds_dir->dd_crypto_obj != 0);
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
err = dsl_prop_register(ds,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_PRIMARYCACHE),
|
2010-05-29 00:45:14 +04:00
|
|
|
primary_cache_changed_cb, os);
|
2013-09-04 16:00:57 +04:00
|
|
|
if (err == 0) {
|
|
|
|
err = dsl_prop_register(ds,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_SECONDARYCACHE),
|
2010-05-29 00:45:14 +04:00
|
|
|
secondary_cache_changed_cb, os);
|
2013-09-04 16:00:57 +04:00
|
|
|
}
|
2023-10-24 21:00:07 +03:00
|
|
|
if (err == 0) {
|
|
|
|
err = dsl_prop_register(ds,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_PREFETCH),
|
|
|
|
prefetch_changed_cb, os);
|
|
|
|
}
|
2015-04-02 06:44:32 +03:00
|
|
|
if (!ds->ds_is_snapshot) {
|
2013-09-04 16:00:57 +04:00
|
|
|
if (err == 0) {
|
|
|
|
err = dsl_prop_register(ds,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_CHECKSUM),
|
2010-05-29 00:45:14 +04:00
|
|
|
checksum_changed_cb, os);
|
2013-09-04 16:00:57 +04:00
|
|
|
}
|
|
|
|
if (err == 0) {
|
|
|
|
err = dsl_prop_register(ds,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_COMPRESSION),
|
2010-05-29 00:45:14 +04:00
|
|
|
compression_changed_cb, os);
|
2013-09-04 16:00:57 +04:00
|
|
|
}
|
|
|
|
if (err == 0) {
|
|
|
|
err = dsl_prop_register(ds,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_COPIES),
|
2010-05-29 00:45:14 +04:00
|
|
|
copies_changed_cb, os);
|
2013-09-04 16:00:57 +04:00
|
|
|
}
|
|
|
|
if (err == 0) {
|
|
|
|
err = dsl_prop_register(ds,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_DEDUP),
|
2010-05-29 00:45:14 +04:00
|
|
|
dedup_changed_cb, os);
|
2013-09-04 16:00:57 +04:00
|
|
|
}
|
|
|
|
if (err == 0) {
|
|
|
|
err = dsl_prop_register(ds,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_LOGBIAS),
|
2010-05-29 00:45:14 +04:00
|
|
|
logbias_changed_cb, os);
|
2013-09-04 16:00:57 +04:00
|
|
|
}
|
|
|
|
if (err == 0) {
|
|
|
|
err = dsl_prop_register(ds,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_SYNC),
|
2010-05-29 00:45:14 +04:00
|
|
|
sync_changed_cb, os);
|
2013-09-04 16:00:57 +04:00
|
|
|
}
|
2014-05-23 20:21:07 +04:00
|
|
|
if (err == 0) {
|
|
|
|
err = dsl_prop_register(ds,
|
|
|
|
zfs_prop_to_name(
|
|
|
|
ZFS_PROP_REDUNDANT_METADATA),
|
|
|
|
redundant_metadata_changed_cb, os);
|
|
|
|
}
|
2014-11-03 23:15:08 +03:00
|
|
|
if (err == 0) {
|
|
|
|
err = dsl_prop_register(ds,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_RECORDSIZE),
|
|
|
|
recordsize_changed_cb, os);
|
|
|
|
}
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
if (err == 0) {
|
|
|
|
err = dsl_prop_register(ds,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_DNODESIZE),
|
|
|
|
dnodesize_changed_cb, os);
|
|
|
|
}
|
2018-09-06 04:33:36 +03:00
|
|
|
if (err == 0) {
|
|
|
|
err = dsl_prop_register(ds,
|
|
|
|
zfs_prop_to_name(
|
|
|
|
ZFS_PROP_SPECIAL_SMALL_BLOCKS),
|
|
|
|
smallblk_changed_cb, os);
|
|
|
|
}
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
if (err == 0) {
|
|
|
|
err = dsl_prop_register(ds,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_DIRECT),
|
|
|
|
direct_changed_cb, os);
|
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
2013-09-04 16:00:57 +04:00
|
|
|
if (err != 0) {
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_destroy(os->os_phys_buf, &os->os_phys_buf);
|
2010-05-29 00:45:14 +04:00
|
|
|
kmem_free(os, sizeof (objset_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
return (err);
|
|
|
|
}
|
2014-06-06 01:19:08 +04:00
|
|
|
} else {
|
2008-11-20 23:01:55 +03:00
|
|
|
/* It's the meta-objset. */
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_checksum = ZIO_CHECKSUM_FLETCHER_4;
|
2015-07-06 04:55:32 +03:00
|
|
|
os->os_compress = ZIO_COMPRESS_ON;
|
Add zstd support to zfs
This PR adds two new compression types, based on ZStandard:
- zstd: A basic ZStandard compression algorithm Available compression.
Levels for zstd are zstd-1 through zstd-19, where the compression
increases with every level, but speed decreases.
- zstd-fast: A faster version of the ZStandard compression algorithm
zstd-fast is basically a "negative" level of zstd. The compression
decreases with every level, but speed increases.
Available compression levels for zstd-fast:
- zstd-fast-1 through zstd-fast-10
- zstd-fast-20 through zstd-fast-100 (in increments of 10)
- zstd-fast-500 and zstd-fast-1000
For more information check the man page.
Implementation details:
Rather than treat each level of zstd as a different algorithm (as was
done historically with gzip), the block pointer `enum zio_compress`
value is simply zstd for all levels, including zstd-fast, since they all
use the same decompression function.
The compress= property (a 64bit unsigned integer) uses the lower 7 bits
to store the compression algorithm (matching the number of bits used in
a block pointer, as the 8th bit was borrowed for embedded block
pointers). The upper bits are used to store the compression level.
It is necessary to be able to determine what compression level was used
when later reading a block back, so the concept used in LZ4, where the
first 32bits of the on-disk value are the size of the compressed data
(since the allocation is rounded up to the nearest ashift), was
extended, and we store the version of ZSTD and the level as well as the
compressed size. This value is returned when decompressing a block, so
that if the block needs to be recompressed (L2ARC, nop-write, etc), that
the same parameters will be used to result in the matching checksum.
All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`,
`zio_prop_t`, etc.) uses the separated _compress and _complevel
variables. Only the properties ZAP contains the combined/bit-shifted
value. The combined value is split when the compression_changed_cb()
callback is called, and sets both objset members (os_compress and
os_complevel).
The userspace tools all use the combined/bit-shifted value.
Additional notes:
zdb can now also decode the ZSTD compression header (flag -Z) and
inspect the size, version and compression level saved in that header.
For each record, if it is ZSTD compressed, the parameters of the decoded
compression header get printed.
ZSTD is included with all current tests and new tests are added
as-needed.
Per-dataset feature flags now get activated when the property is set.
If a compression algorithm requires a feature flag, zfs activates the
feature when the property is set, rather than waiting for the first
block to be born. This is currently only used by zstd but can be
extended as needed.
Portions-Sponsored-By: The FreeBSD Foundation
Co-authored-by: Allan Jude <allanjude@freebsd.org>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Co-authored-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #6247
Closes #9024
Closes #10277
Closes #10278
2020-08-18 20:10:17 +03:00
|
|
|
os->os_complevel = ZIO_COMPLEVEL_DEFAULT;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
os->os_encrypted = B_FALSE;
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_copies = spa_max_replication(spa);
|
|
|
|
os->os_dedup_checksum = ZIO_CHECKSUM_OFF;
|
2014-05-23 20:21:07 +04:00
|
|
|
os->os_dedup_verify = B_FALSE;
|
|
|
|
os->os_logbias = ZFS_LOGBIAS_LATENCY;
|
|
|
|
os->os_sync = ZFS_SYNC_STANDARD;
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_primary_cache = ZFS_CACHE_ALL;
|
|
|
|
os->os_secondary_cache = ZFS_CACHE_ALL;
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
os->os_dnodesize = DNODE_MIN_SIZE;
|
2023-10-24 21:00:07 +03:00
|
|
|
os->os_prefetch = ZFS_PREFETCH_ALL;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
if (ds == NULL || !ds->ds_is_snapshot)
|
2010-08-27 01:24:34 +04:00
|
|
|
os->os_zil_header = os->os_phys->os_zil_header;
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_zil = zil_alloc(os, &os->os_zil_header);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
for (i = 0; i < TXG_SIZE; i++) {
|
2021-06-10 19:42:31 +03:00
|
|
|
multilist_create(&os->os_dirty_dnodes[i], sizeof (dnode_t),
|
2017-03-21 04:36:00 +03:00
|
|
|
offsetof(dnode_t, dn_dirty_link[i]),
|
|
|
|
dnode_multilist_index_func);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
list_create(&os->os_dnodes, sizeof (dnode_t),
|
2008-11-20 23:01:55 +03:00
|
|
|
offsetof(dnode_t, dn_link));
|
2010-05-29 00:45:14 +04:00
|
|
|
list_create(&os->os_downgraded_dbufs, sizeof (dmu_buf_impl_t),
|
2008-11-20 23:01:55 +03:00
|
|
|
offsetof(dmu_buf_impl_t, db_link));
|
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
list_link_init(&os->os_evicting_node);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_init(&os->os_lock, NULL, MUTEX_DEFAULT, NULL);
|
2017-03-21 04:36:00 +03:00
|
|
|
mutex_init(&os->os_userused_lock, NULL, MUTEX_DEFAULT, NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_init(&os->os_obj_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&os->os_user_ptr_lock, NULL, MUTEX_DEFAULT, NULL);
|
OpenZFS 8199 - multi-threaded dmu_object_alloc()
dmu_object_alloc() is single-threaded, so when multiple threads are
creating files in a single filesystem, they spend a lot of time waiting
for the os_obj_lock. To improve performance of multi-threaded file
creation, we must make dmu_object_alloc() typically not grab any
filesystem-wide locks.
The solution is to have a "next object to allocate" for each CPU. Each
of these "next object"s is in a different block of the dnode object, so
that concurrent allocation holds dnodes in different dbufs. When a
thread's "next object" reaches the end of a chunk of objects (by default
4 blocks worth -- 128 dnodes), it will be reset to the per-objset
os_obj_next, which will be increased by a chunk of objects (128). Only
when manipulating the os_obj_next will we need to grab the os_obj_lock.
This decreases lock contention dramatically, because each thread only
needs to grab the os_obj_lock briefly, once per 128 allocations.
This results in a 70% performance improvement to multi-threaded object
creation (where each thread is creating objects in its own directory),
from 67,000/sec to 115,000/sec, with 8 CPUs.
Work sponsored by Intel Corp.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
OpenZFS-issue: https://www.illumos.org/issues/8199
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/374
Closes #4703
Closes #6117
2016-05-13 07:16:36 +03:00
|
|
|
os->os_obj_next_percpu_len = boot_ncpus;
|
|
|
|
os->os_obj_next_percpu = kmem_zalloc(os->os_obj_next_percpu_len *
|
|
|
|
sizeof (os->os_obj_next_percpu[0]), KM_SLEEP);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
dnode_special_open(os, &os->os_phys->os_meta_dnode,
|
|
|
|
DMU_META_DNODE_OBJECT, &os->os_meta_dnode);
|
2018-02-14 01:54:54 +03:00
|
|
|
if (OBJSET_BUF_HAS_USERUSED(os->os_phys_buf)) {
|
2015-04-02 06:44:32 +03:00
|
|
|
dnode_special_open(os, &os->os_phys->os_userused_dnode,
|
|
|
|
DMU_USERUSED_OBJECT, &os->os_userused_dnode);
|
|
|
|
dnode_special_open(os, &os->os_phys->os_groupused_dnode,
|
|
|
|
DMU_GROUPUSED_OBJECT, &os->os_groupused_dnode);
|
2018-02-14 01:54:54 +03:00
|
|
|
if (OBJSET_BUF_HAS_PROJECTUSED(os->os_phys_buf))
|
|
|
|
dnode_special_open(os,
|
|
|
|
&os->os_phys->os_projectused_dnode,
|
|
|
|
DMU_PROJECTUSED_OBJECT, &os->os_projectused_dnode);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
mutex_init(&os->os_upgrade_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
*osp = os;
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
int
|
|
|
|
dmu_objset_from_ds(dsl_dataset_t *ds, objset_t **osp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
int err = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-01-07 00:22:48 +03:00
|
|
|
/*
|
dmu_objset_from_ds must be called with dp_config_rwlock held
The normal lock order is that the dp_config_rwlock must be held before
the ds_opening_lock. For example, dmu_objset_hold() does this.
However, dmu_objset_open_impl() is called with the ds_opening_lock held,
and if the dp_config_rwlock is not already held, it will attempt to
acquire it. This may lead to deadlock, since the lock order is
reversed.
Looking at all the callers of dmu_objset_open_impl() (which is
principally the callers of dmu_objset_from_ds()), almost all callers
already have the dp_config_rwlock. However, there are a few places in
the send and receive code paths that do not. For example:
dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream,
receive_write_byref, redact_traverse_thread.
This commit resolves the problem by requiring all callers ot
dmu_objset_from_ds() to hold the dp_config_rwlock. In most cases, the
code has been restructured such that we call dmu_objset_from_ds()
earlier on in the send and receive processes, when we already have the
dp_config_rwlock, and save the objset_t until we need it in the middle
of the send or receive (similar to what we already do with the
dsl_dataset_t). Thus we do not need to acquire the dp_config_rwlock in
many new places.
I also cleaned up code in dmu_redact_snap() and send_traverse_thread().
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9662
Closes #10115
2020-03-12 20:55:02 +03:00
|
|
|
* We need the pool_config lock to manipulate the dsl_dataset_t.
|
|
|
|
* Even if the dataset is long-held, we need the pool_config lock
|
|
|
|
* to open the objset, as it needs to get properties.
|
2016-01-07 00:22:48 +03:00
|
|
|
*/
|
dmu_objset_from_ds must be called with dp_config_rwlock held
The normal lock order is that the dp_config_rwlock must be held before
the ds_opening_lock. For example, dmu_objset_hold() does this.
However, dmu_objset_open_impl() is called with the ds_opening_lock held,
and if the dp_config_rwlock is not already held, it will attempt to
acquire it. This may lead to deadlock, since the lock order is
reversed.
Looking at all the callers of dmu_objset_open_impl() (which is
principally the callers of dmu_objset_from_ds()), almost all callers
already have the dp_config_rwlock. However, there are a few places in
the send and receive code paths that do not. For example:
dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream,
receive_write_byref, redact_traverse_thread.
This commit resolves the problem by requiring all callers ot
dmu_objset_from_ds() to hold the dp_config_rwlock. In most cases, the
code has been restructured such that we call dmu_objset_from_ds()
earlier on in the send and receive processes, when we already have the
dp_config_rwlock, and save the objset_t until we need it in the middle
of the send or receive (similar to what we already do with the
dsl_dataset_t). Thus we do not need to acquire the dp_config_rwlock in
many new places.
I also cleaned up code in dmu_redact_snap() and send_traverse_thread().
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9662
Closes #10115
2020-03-12 20:55:02 +03:00
|
|
|
ASSERT(dsl_pool_config_held(ds->ds_dir->dd_pool));
|
2016-01-07 00:22:48 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_enter(&ds->ds_opening_lock);
|
2014-06-06 01:19:08 +04:00
|
|
|
if (ds->ds_objset == NULL) {
|
|
|
|
objset_t *os;
|
2017-01-27 22:43:42 +03:00
|
|
|
rrw_enter(&ds->ds_bp_rwlock, RW_READER, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
err = dmu_objset_open_impl(dsl_dataset_get_spa(ds),
|
2014-06-06 01:19:08 +04:00
|
|
|
ds, dsl_dataset_get_blkptr(ds), &os);
|
2017-01-27 22:43:42 +03:00
|
|
|
rrw_exit(&ds->ds_bp_rwlock, FTAG);
|
2014-06-06 01:19:08 +04:00
|
|
|
|
|
|
|
if (err == 0) {
|
|
|
|
mutex_enter(&ds->ds_lock);
|
|
|
|
ASSERT(ds->ds_objset == NULL);
|
|
|
|
ds->ds_objset = os;
|
|
|
|
mutex_exit(&ds->ds_lock);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2014-06-06 01:19:08 +04:00
|
|
|
*osp = ds->ds_objset;
|
2008-11-20 23:01:55 +03:00
|
|
|
mutex_exit(&ds->ds_opening_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
return (err);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
/*
|
|
|
|
* Holds the pool while the objset is held. Therefore only one objset
|
|
|
|
* can be held at a time.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
int
|
2022-04-19 21:38:30 +03:00
|
|
|
dmu_objset_hold_flags(const char *name, boolean_t decrypt, const void *tag,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
objset_t **osp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_pool_t *dp;
|
2010-05-29 00:45:14 +04:00
|
|
|
dsl_dataset_t *ds;
|
2008-11-20 23:01:55 +03:00
|
|
|
int err;
|
2020-12-28 03:31:02 +03:00
|
|
|
ds_hold_flags_t flags;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-12-28 03:31:02 +03:00
|
|
|
flags = (decrypt) ? DS_HOLD_FLAG_DECRYPT : DS_HOLD_FLAG_NONE;
|
2013-09-04 16:00:57 +04:00
|
|
|
err = dsl_pool_hold(name, tag, &dp);
|
|
|
|
if (err != 0)
|
|
|
|
return (err);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
err = dsl_dataset_hold_flags(dp, name, flags, tag, &ds);
|
2013-09-04 16:00:57 +04:00
|
|
|
if (err != 0) {
|
|
|
|
dsl_pool_rele(dp, tag);
|
2010-05-29 00:45:14 +04:00
|
|
|
return (err);
|
2013-09-04 16:00:57 +04:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
err = dmu_objset_from_ds(ds, osp);
|
2013-09-04 16:00:57 +04:00
|
|
|
if (err != 0) {
|
2010-05-29 00:45:14 +04:00
|
|
|
dsl_dataset_rele(ds, tag);
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_pool_rele(dp, tag);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
int
|
2022-04-19 21:38:30 +03:00
|
|
|
dmu_objset_hold(const char *name, const void *tag, objset_t **osp)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
{
|
|
|
|
return (dmu_objset_hold_flags(name, B_FALSE, tag, osp));
|
|
|
|
}
|
|
|
|
|
2015-05-06 19:07:55 +03:00
|
|
|
static int
|
|
|
|
dmu_objset_own_impl(dsl_dataset_t *ds, dmu_objset_type_t type,
|
2022-04-19 21:38:30 +03:00
|
|
|
boolean_t readonly, boolean_t decrypt, const void *tag, objset_t **osp)
|
2015-05-06 19:07:55 +03:00
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) tag;
|
2015-05-06 19:07:55 +03:00
|
|
|
|
2021-12-12 18:06:44 +03:00
|
|
|
int err = dmu_objset_from_ds(ds, osp);
|
2015-05-06 19:07:55 +03:00
|
|
|
if (err != 0) {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
return (err);
|
2015-05-06 19:07:55 +03:00
|
|
|
} else if (type != DMU_OST_ANY && type != (*osp)->os_phys->os_type) {
|
|
|
|
return (SET_ERROR(EINVAL));
|
|
|
|
} else if (!readonly && dsl_dataset_is_snapshot(ds)) {
|
|
|
|
return (SET_ERROR(EROFS));
|
2017-11-08 22:12:59 +03:00
|
|
|
} else if (!readonly && decrypt &&
|
|
|
|
dsl_dir_incompatible_encryption_version(ds->ds_dir)) {
|
|
|
|
return (SET_ERROR(EROFS));
|
2015-05-06 19:07:55 +03:00
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
/* if we are decrypting, we can now check MACs in os->os_phys_buf */
|
|
|
|
if (decrypt && arc_is_unauthenticated((*osp)->os_phys_buf)) {
|
2018-03-31 21:12:51 +03:00
|
|
|
zbookmark_phys_t zb;
|
|
|
|
|
|
|
|
SET_BOOKMARK(&zb, ds->ds_object, ZB_ROOT_OBJECT,
|
|
|
|
ZB_ROOT_LEVEL, ZB_ROOT_BLKID);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
err = arc_untransform((*osp)->os_phys_buf, (*osp)->os_spa,
|
2018-03-31 21:12:51 +03:00
|
|
|
&zb, B_FALSE);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (err != 0)
|
|
|
|
return (err);
|
|
|
|
|
|
|
|
ASSERT0(arc_is_unauthenticated((*osp)->os_phys_buf));
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
2015-05-06 19:07:55 +03:00
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
/*
|
|
|
|
* dsl_pool must not be held when this is called.
|
|
|
|
* Upon successful return, there will be a longhold on the dataset,
|
|
|
|
* and the dsl_pool will not be held.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
int
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_objset_own(const char *name, dmu_objset_type_t type,
|
2022-04-19 21:38:30 +03:00
|
|
|
boolean_t readonly, boolean_t decrypt, const void *tag, objset_t **osp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_pool_t *dp;
|
2008-11-20 23:01:55 +03:00
|
|
|
dsl_dataset_t *ds;
|
|
|
|
int err;
|
2020-12-28 03:31:02 +03:00
|
|
|
ds_hold_flags_t flags;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-12-28 03:31:02 +03:00
|
|
|
flags = (decrypt) ? DS_HOLD_FLAG_DECRYPT : DS_HOLD_FLAG_NONE;
|
2013-09-04 16:00:57 +04:00
|
|
|
err = dsl_pool_hold(name, FTAG, &dp);
|
|
|
|
if (err != 0)
|
|
|
|
return (err);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
err = dsl_dataset_own(dp, name, flags, tag, &ds);
|
2013-09-04 16:00:57 +04:00
|
|
|
if (err != 0) {
|
|
|
|
dsl_pool_rele(dp, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (err);
|
2013-09-04 16:00:57 +04:00
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
err = dmu_objset_own_impl(ds, type, readonly, decrypt, tag, osp);
|
|
|
|
if (err != 0) {
|
|
|
|
dsl_dataset_disown(ds, flags, tag);
|
|
|
|
dsl_pool_rele(dp, FTAG);
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
2018-02-21 03:27:31 +03:00
|
|
|
/*
|
|
|
|
* User accounting requires the dataset to be decrypted and rw.
|
|
|
|
* We also don't begin user accounting during claiming to help
|
|
|
|
* speed up pool import times and to keep this txg reserved
|
|
|
|
* completely for recovery work.
|
|
|
|
*/
|
2020-11-16 20:10:29 +03:00
|
|
|
if (!readonly && !dp->dp_spa->spa_claiming &&
|
|
|
|
(ds->ds_dir->dd_crypto_obj == 0 || decrypt)) {
|
|
|
|
if (dmu_objset_userobjspace_upgradable(*osp) ||
|
|
|
|
dmu_objset_projectquota_upgradable(*osp)) {
|
|
|
|
dmu_objset_id_quota_upgrade(*osp);
|
|
|
|
} else if (dmu_objset_userused_enabled(*osp)) {
|
|
|
|
dmu_objset_userspace_upgrade(*osp);
|
|
|
|
}
|
|
|
|
}
|
2016-10-04 21:46:10 +03:00
|
|
|
|
2017-11-11 00:37:10 +03:00
|
|
|
dsl_pool_rele(dp, FTAG);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
return (0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-05-06 19:07:55 +03:00
|
|
|
int
|
|
|
|
dmu_objset_own_obj(dsl_pool_t *dp, uint64_t obj, dmu_objset_type_t type,
|
2022-04-19 21:38:30 +03:00
|
|
|
boolean_t readonly, boolean_t decrypt, const void *tag, objset_t **osp)
|
2015-05-06 19:07:55 +03:00
|
|
|
{
|
|
|
|
dsl_dataset_t *ds;
|
|
|
|
int err;
|
2020-12-28 03:31:02 +03:00
|
|
|
ds_hold_flags_t flags;
|
2015-05-06 19:07:55 +03:00
|
|
|
|
2020-12-28 03:31:02 +03:00
|
|
|
flags = (decrypt) ? DS_HOLD_FLAG_DECRYPT : DS_HOLD_FLAG_NONE;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
err = dsl_dataset_own_obj(dp, obj, flags, tag, &ds);
|
2015-05-06 19:07:55 +03:00
|
|
|
if (err != 0)
|
|
|
|
return (err);
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
err = dmu_objset_own_impl(ds, type, readonly, decrypt, tag, osp);
|
|
|
|
if (err != 0) {
|
|
|
|
dsl_dataset_disown(ds, flags, tag);
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
2015-05-06 19:07:55 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
2022-04-19 21:38:30 +03:00
|
|
|
dmu_objset_rele_flags(objset_t *os, boolean_t decrypt, const void *tag)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2020-12-28 03:31:02 +03:00
|
|
|
ds_hold_flags_t flags;
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_pool_t *dp = dmu_objset_pool(os);
|
2020-12-28 03:31:02 +03:00
|
|
|
|
|
|
|
flags = (decrypt) ? DS_HOLD_FLAG_DECRYPT : DS_HOLD_FLAG_NONE;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dsl_dataset_rele_flags(os->os_dsl_dataset, flags, tag);
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_pool_rele(dp, tag);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2008-12-03 23:09:06 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
void
|
2022-04-19 21:38:30 +03:00
|
|
|
dmu_objset_rele(objset_t *os, const void *tag)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
{
|
|
|
|
dmu_objset_rele_flags(os, B_FALSE, tag);
|
|
|
|
}
|
|
|
|
|
2013-07-27 21:50:07 +04:00
|
|
|
/*
|
|
|
|
* When we are called, os MUST refer to an objset associated with a dataset
|
|
|
|
* that is owned by 'tag'; that is, is held and long held by 'tag' and ds_owner
|
|
|
|
* == tag. We will then release and reacquire ownership of the dataset while
|
|
|
|
* holding the pool config_rwlock to avoid intervening namespace or ownership
|
|
|
|
* changes may occur.
|
|
|
|
*
|
|
|
|
* This exists solely to accommodate zfs_ioc_userspace_upgrade()'s desire to
|
|
|
|
* release the hold on its dataset and acquire a new one on the dataset of the
|
|
|
|
* same name so that it can be partially torn down and reconstructed.
|
|
|
|
*/
|
|
|
|
void
|
2018-02-21 15:55:55 +03:00
|
|
|
dmu_objset_refresh_ownership(dsl_dataset_t *ds, dsl_dataset_t **newds,
|
2022-04-19 21:38:30 +03:00
|
|
|
boolean_t decrypt, const void *tag)
|
2013-07-27 21:50:07 +04:00
|
|
|
{
|
|
|
|
dsl_pool_t *dp;
|
2016-06-16 00:28:36 +03:00
|
|
|
char name[ZFS_MAX_DATASET_NAME_LEN];
|
2020-12-28 03:31:02 +03:00
|
|
|
ds_hold_flags_t flags;
|
2013-07-27 21:50:07 +04:00
|
|
|
|
2020-12-28 03:31:02 +03:00
|
|
|
flags = (decrypt) ? DS_HOLD_FLAG_DECRYPT : DS_HOLD_FLAG_NONE;
|
2013-07-27 21:50:07 +04:00
|
|
|
VERIFY3P(ds, !=, NULL);
|
|
|
|
VERIFY3P(ds->ds_owner, ==, tag);
|
|
|
|
VERIFY(dsl_dataset_long_held(ds));
|
|
|
|
|
|
|
|
dsl_dataset_name(ds, name);
|
2018-02-21 15:55:55 +03:00
|
|
|
dp = ds->ds_dir->dd_pool;
|
2013-07-27 21:50:07 +04:00
|
|
|
dsl_pool_config_enter(dp, FTAG);
|
2020-12-28 03:31:02 +03:00
|
|
|
dsl_dataset_disown(ds, flags, tag);
|
|
|
|
VERIFY0(dsl_dataset_own(dp, name, flags, tag, newds));
|
2013-07-27 21:50:07 +04:00
|
|
|
dsl_pool_config_exit(dp, FTAG);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
void
|
2022-04-19 21:38:30 +03:00
|
|
|
dmu_objset_disown(objset_t *os, boolean_t decrypt, const void *tag)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2020-12-28 03:31:02 +03:00
|
|
|
ds_hold_flags_t flags;
|
|
|
|
|
|
|
|
flags = (decrypt) ? DS_HOLD_FLAG_DECRYPT : DS_HOLD_FLAG_NONE;
|
2016-10-04 21:46:10 +03:00
|
|
|
/*
|
|
|
|
* Stop upgrading thread
|
|
|
|
*/
|
|
|
|
dmu_objset_upgrade_stop(os);
|
2020-12-28 03:31:02 +03:00
|
|
|
dsl_dataset_disown(os->os_dsl_dataset, flags, tag);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
void
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_objset_evict_dbufs(objset_t *os)
|
|
|
|
{
|
2015-04-02 06:44:32 +03:00
|
|
|
dnode_t *dn_marker;
|
2008-11-20 23:01:55 +03:00
|
|
|
dnode_t *dn;
|
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
dn_marker = kmem_alloc(sizeof (dnode_t), KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
mutex_enter(&os->os_lock);
|
|
|
|
dn = list_head(&os->os_dnodes);
|
|
|
|
while (dn != NULL) {
|
|
|
|
/*
|
|
|
|
* Skip dnodes without holds. We have to do this dance
|
|
|
|
* because dnode_add_ref() only works if there is already a
|
|
|
|
* hold. If the dnode has no holds, then it has no dbufs.
|
|
|
|
*/
|
|
|
|
if (dnode_add_ref(dn, FTAG)) {
|
|
|
|
list_insert_after(&os->os_dnodes, dn, dn_marker);
|
|
|
|
mutex_exit(&os->os_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
dnode_evict_dbufs(dn);
|
|
|
|
dnode_rele(dn, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
mutex_enter(&os->os_lock);
|
|
|
|
dn = list_next(&os->os_dnodes, dn_marker);
|
|
|
|
list_remove(&os->os_dnodes, dn_marker);
|
|
|
|
} else {
|
|
|
|
dn = list_next(&os->os_dnodes, dn);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
mutex_exit(&os->os_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
kmem_free(dn_marker, sizeof (dnode_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
if (DMU_USERUSED_DNODE(os) != NULL) {
|
2018-02-14 01:54:54 +03:00
|
|
|
if (DMU_PROJECTUSED_DNODE(os) != NULL)
|
|
|
|
dnode_evict_dbufs(DMU_PROJECTUSED_DNODE(os));
|
2015-04-02 06:44:32 +03:00
|
|
|
dnode_evict_dbufs(DMU_GROUPUSED_DNODE(os));
|
|
|
|
dnode_evict_dbufs(DMU_USERUSED_DNODE(os));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2015-04-02 06:44:32 +03:00
|
|
|
dnode_evict_dbufs(DMU_META_DNODE(os));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
/*
|
|
|
|
* Objset eviction processing is split into into two pieces.
|
|
|
|
* The first marks the objset as evicting, evicts any dbufs that
|
|
|
|
* have a refcount of zero, and then queues up the objset for the
|
|
|
|
* second phase of eviction. Once os->os_dnodes has been cleared by
|
|
|
|
* dnode_buf_pageout()->dnode_destroy(), the second phase is executed.
|
|
|
|
* The second phase closes the special dnodes, dequeues the objset from
|
|
|
|
* the list of those undergoing eviction, and finally frees the objset.
|
|
|
|
*
|
|
|
|
* NOTE: Due to asynchronous eviction processing (invocation of
|
|
|
|
* dnode_buf_pageout()), it is possible for the meta dnode for the
|
|
|
|
* objset to have no holds even though os->os_dnodes is not empty.
|
|
|
|
*/
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_objset_evict(objset_t *os)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2013-08-28 15:45:09 +04:00
|
|
|
dsl_dataset_t *ds = os->os_dsl_dataset;
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int t = 0; t < TXG_SIZE; t++)
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(!dmu_objset_is_dirty(os, t));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-11-05 02:00:58 +03:00
|
|
|
if (ds)
|
|
|
|
dsl_prop_unregister_all(ds, os);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (os->os_sa)
|
|
|
|
sa_tear_down(os);
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
dmu_objset_evict_dbufs(os);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
mutex_enter(&os->os_lock);
|
|
|
|
spa_evicting_os_register(os->os_spa, os);
|
|
|
|
if (list_is_empty(&os->os_dnodes)) {
|
|
|
|
mutex_exit(&os->os_lock);
|
|
|
|
dmu_objset_evict_done(os);
|
|
|
|
} else {
|
|
|
|
mutex_exit(&os->os_lock);
|
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
|
2015-04-02 06:44:32 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
dmu_objset_evict_done(objset_t *os)
|
|
|
|
{
|
|
|
|
ASSERT3P(list_head(&os->os_dnodes), ==, NULL);
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
dnode_special_close(&os->os_meta_dnode);
|
|
|
|
if (DMU_USERUSED_DNODE(os)) {
|
2018-02-14 01:54:54 +03:00
|
|
|
if (DMU_PROJECTUSED_DNODE(os))
|
|
|
|
dnode_special_close(&os->os_projectused_dnode);
|
2010-08-27 01:24:34 +04:00
|
|
|
dnode_special_close(&os->os_userused_dnode);
|
|
|
|
dnode_special_close(&os->os_groupused_dnode);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
zil_free(os->os_zil);
|
|
|
|
|
2016-06-02 07:04:53 +03:00
|
|
|
arc_buf_destroy(os->os_phys_buf, &os->os_phys_buf);
|
2010-08-27 01:24:34 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* This is a barrier to prevent the objset from going away in
|
|
|
|
* dnode_move() until we can safely ensure that the objset is still in
|
|
|
|
* use. We consider the objset valid before the barrier and invalid
|
|
|
|
* after the barrier.
|
|
|
|
*/
|
|
|
|
rw_enter(&os_lock, RW_READER);
|
|
|
|
rw_exit(&os_lock);
|
|
|
|
|
OpenZFS 8199 - multi-threaded dmu_object_alloc()
dmu_object_alloc() is single-threaded, so when multiple threads are
creating files in a single filesystem, they spend a lot of time waiting
for the os_obj_lock. To improve performance of multi-threaded file
creation, we must make dmu_object_alloc() typically not grab any
filesystem-wide locks.
The solution is to have a "next object to allocate" for each CPU. Each
of these "next object"s is in a different block of the dnode object, so
that concurrent allocation holds dnodes in different dbufs. When a
thread's "next object" reaches the end of a chunk of objects (by default
4 blocks worth -- 128 dnodes), it will be reset to the per-objset
os_obj_next, which will be increased by a chunk of objects (128). Only
when manipulating the os_obj_next will we need to grab the os_obj_lock.
This decreases lock contention dramatically, because each thread only
needs to grab the os_obj_lock briefly, once per 128 allocations.
This results in a 70% performance improvement to multi-threaded object
creation (where each thread is creating objects in its own directory),
from 67,000/sec to 115,000/sec, with 8 CPUs.
Work sponsored by Intel Corp.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
OpenZFS-issue: https://www.illumos.org/issues/8199
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/374
Closes #4703
Closes #6117
2016-05-13 07:16:36 +03:00
|
|
|
kmem_free(os->os_obj_next_percpu,
|
|
|
|
os->os_obj_next_percpu_len * sizeof (os->os_obj_next_percpu[0]));
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_destroy(&os->os_lock);
|
2017-03-21 04:36:00 +03:00
|
|
|
mutex_destroy(&os->os_userused_lock);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_destroy(&os->os_obj_lock);
|
|
|
|
mutex_destroy(&os->os_user_ptr_lock);
|
2016-11-26 23:30:44 +03:00
|
|
|
mutex_destroy(&os->os_upgrade_lock);
|
2021-06-10 19:42:31 +03:00
|
|
|
for (int i = 0; i < TXG_SIZE; i++)
|
|
|
|
multilist_destroy(&os->os_dirty_dnodes[i]);
|
2015-04-02 06:44:32 +03:00
|
|
|
spa_evicting_os_deregister(os->os_spa, os);
|
2010-05-29 00:45:14 +04:00
|
|
|
kmem_free(os, sizeof (objset_t));
|
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2018-06-20 07:51:18 +03:00
|
|
|
inode_timespec_t
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_objset_snap_cmtime(objset_t *os)
|
|
|
|
{
|
|
|
|
return (dsl_dir_snap_cmtime(os->os_dsl_dataset->ds_dir));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_create_impl_dnstats(spa_t *spa, dsl_dataset_t *ds, blkptr_t *bp,
|
|
|
|
dmu_objset_type_t type, int levels, int blksz, int ibs, dmu_tx_t *tx)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os;
|
2008-11-20 23:01:55 +03:00
|
|
|
dnode_t *mdn;
|
|
|
|
|
|
|
|
ASSERT(dmu_tx_is_syncing(tx));
|
2013-09-04 16:00:57 +04:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (blksz == 0)
|
|
|
|
blksz = DNODE_BLOCK_SIZE;
|
2017-09-12 23:15:11 +03:00
|
|
|
if (ibs == 0)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ibs = DN_MAX_INDBLKSHIFT;
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
if (ds != NULL)
|
2013-09-04 16:00:57 +04:00
|
|
|
VERIFY0(dmu_objset_from_ds(ds, &os));
|
2010-08-27 01:24:34 +04:00
|
|
|
else
|
2013-09-04 16:00:57 +04:00
|
|
|
VERIFY0(dmu_objset_open_impl(spa, NULL, bp, &os));
|
2010-08-27 01:24:34 +04:00
|
|
|
|
|
|
|
mdn = DMU_META_DNODE(os);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dnode_allocate(mdn, DMU_OT_DNODE, blksz, ibs, DMU_OT_NONE, 0,
|
|
|
|
DNODE_MIN_SLOTS, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* We don't want to have to increase the meta-dnode's nlevels
|
2019-09-03 03:56:41 +03:00
|
|
|
* later, because then we could do it in quiescing context while
|
2008-11-20 23:01:55 +03:00
|
|
|
* we are also accessing it in open context.
|
|
|
|
*
|
|
|
|
* This precaution is not necessary for the MOS (ds == NULL),
|
|
|
|
* because the MOS is only updated in syncing context.
|
|
|
|
* This is most fortunate: the MOS is the only objset that
|
|
|
|
* needs to be synced multiple times as spa_sync() iterates
|
|
|
|
* to convergence, so minimizing its dn_nlevels matters.
|
|
|
|
*/
|
|
|
|
if (ds != NULL) {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (levels == 0) {
|
|
|
|
levels = 1;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Determine the number of levels necessary for the
|
|
|
|
* meta-dnode to contain DN_MAX_OBJECT dnodes. Note
|
|
|
|
* that in order to ensure that we do not overflow
|
|
|
|
* 64 bits, there has to be a nlevels that gives us a
|
|
|
|
* number of blocks > DN_MAX_OBJECT but < 2^64.
|
|
|
|
* Therefore, (mdn->dn_indblkshift - SPA_BLKPTRSHIFT)
|
|
|
|
* (10) must be less than (64 - log2(DN_MAX_OBJECT))
|
|
|
|
* (16).
|
|
|
|
*/
|
|
|
|
while ((uint64_t)mdn->dn_nblkptr <<
|
|
|
|
(mdn->dn_datablkshift - DNODE_SHIFT + (levels - 1) *
|
|
|
|
(mdn->dn_indblkshift - SPA_BLKPTRSHIFT)) <
|
|
|
|
DN_MAX_OBJECT)
|
|
|
|
levels++;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
mdn->dn_next_nlevels[tx->tx_txg & TXG_MASK] =
|
|
|
|
mdn->dn_nlevels = levels;
|
|
|
|
}
|
|
|
|
|
|
|
|
ASSERT(type != DMU_OST_NONE);
|
|
|
|
ASSERT(type != DMU_OST_ANY);
|
|
|
|
ASSERT(type < DMU_OST_NUMTYPES);
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_phys->os_type = type;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Enable user accounting if it is enabled and this is not an
|
|
|
|
* encrypted receive.
|
|
|
|
*/
|
|
|
|
if (dmu_objset_userused_enabled(os) &&
|
|
|
|
(!os->os_encrypted || !dmu_objset_is_receiving(os))) {
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_phys->os_flags |= OBJSET_FLAG_USERACCOUNTING_COMPLETE;
|
2016-10-04 21:46:10 +03:00
|
|
|
if (dmu_objset_userobjused_enabled(os)) {
|
2023-02-07 13:43:05 +03:00
|
|
|
ASSERT3P(ds, !=, NULL);
|
2018-10-16 21:15:04 +03:00
|
|
|
ds->ds_feature_activation[
|
|
|
|
SPA_FEATURE_USEROBJ_ACCOUNTING] = (void *)B_TRUE;
|
2016-10-04 21:46:10 +03:00
|
|
|
os->os_phys->os_flags |=
|
|
|
|
OBJSET_FLAG_USEROBJACCOUNTING_COMPLETE;
|
|
|
|
}
|
2018-02-14 01:54:54 +03:00
|
|
|
if (dmu_objset_projectquota_enabled(os)) {
|
2023-03-06 19:15:37 +03:00
|
|
|
ASSERT3P(ds, !=, NULL);
|
2018-10-16 21:15:04 +03:00
|
|
|
ds->ds_feature_activation[
|
|
|
|
SPA_FEATURE_PROJECT_QUOTA] = (void *)B_TRUE;
|
2018-02-14 01:54:54 +03:00
|
|
|
os->os_phys->os_flags |=
|
|
|
|
OBJSET_FLAG_PROJECTQUOTA_COMPLETE;
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_flags = os->os_phys->os_flags;
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
dsl_dataset_dirty(ds, tx);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
return (os);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/* called from dsl for meta-objset */
|
|
|
|
objset_t *
|
|
|
|
dmu_objset_create_impl(spa_t *spa, dsl_dataset_t *ds, blkptr_t *bp,
|
|
|
|
dmu_objset_type_t type, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
return (dmu_objset_create_impl_dnstats(spa, ds, bp, type, 0, 0, 0, tx));
|
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
typedef struct dmu_objset_create_arg {
|
|
|
|
const char *doca_name;
|
|
|
|
cred_t *doca_cred;
|
2020-07-12 03:18:02 +03:00
|
|
|
proc_t *doca_proc;
|
2013-09-04 16:00:57 +04:00
|
|
|
void (*doca_userfunc)(objset_t *os, void *arg,
|
|
|
|
cred_t *cr, dmu_tx_t *tx);
|
|
|
|
void *doca_userarg;
|
|
|
|
dmu_objset_type_t doca_type;
|
|
|
|
uint64_t doca_flags;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dsl_crypto_params_t *doca_dcp;
|
2013-09-04 16:00:57 +04:00
|
|
|
} dmu_objset_create_arg_t;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
static int
|
2013-09-04 16:00:57 +04:00
|
|
|
dmu_objset_create_check(void *arg, dmu_tx_t *tx)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2013-09-04 16:00:57 +04:00
|
|
|
dmu_objset_create_arg_t *doca = arg;
|
|
|
|
dsl_pool_t *dp = dmu_tx_pool(tx);
|
|
|
|
dsl_dir_t *pdd;
|
2019-02-09 02:44:15 +03:00
|
|
|
dsl_dataset_t *parentds;
|
|
|
|
objset_t *parentos;
|
2013-09-04 16:00:57 +04:00
|
|
|
const char *tail;
|
|
|
|
int error;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
if (strchr(doca->doca_name, '@') != NULL)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EINVAL));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2016-06-16 00:28:36 +03:00
|
|
|
if (strlen(doca->doca_name) >= ZFS_MAX_DATASET_NAME_LEN)
|
|
|
|
return (SET_ERROR(ENAMETOOLONG));
|
|
|
|
|
2016-09-12 18:15:20 +03:00
|
|
|
if (dataset_nestcheck(doca->doca_name) != 0)
|
|
|
|
return (SET_ERROR(ENAMETOOLONG));
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_dir_hold(dp, doca->doca_name, FTAG, &pdd, &tail);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
if (tail == NULL) {
|
|
|
|
dsl_dir_rele(pdd, FTAG);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EEXIST));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
2018-06-18 22:47:12 +03:00
|
|
|
error = dmu_objset_create_crypt_check(pdd, doca->doca_dcp, NULL);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (error != 0) {
|
|
|
|
dsl_dir_rele(pdd, FTAG);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2015-04-01 16:07:48 +03:00
|
|
|
error = dsl_fs_ss_limit_check(pdd, 1, ZFS_PROP_FILESYSTEM_LIMIT, NULL,
|
2020-07-12 03:18:02 +03:00
|
|
|
doca->doca_cred, doca->doca_proc);
|
2019-02-09 02:44:15 +03:00
|
|
|
if (error != 0) {
|
|
|
|
dsl_dir_rele(pdd, FTAG);
|
|
|
|
return (error);
|
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
2019-02-09 02:44:15 +03:00
|
|
|
/* can't create below anything but filesystems (eg. no ZVOLs) */
|
|
|
|
error = dsl_dataset_hold_obj(pdd->dd_pool,
|
|
|
|
dsl_dir_phys(pdd)->dd_head_dataset_obj, FTAG, &parentds);
|
|
|
|
if (error != 0) {
|
|
|
|
dsl_dir_rele(pdd, FTAG);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
error = dmu_objset_from_ds(parentds, &parentos);
|
|
|
|
if (error != 0) {
|
|
|
|
dsl_dataset_rele(parentds, FTAG);
|
|
|
|
dsl_dir_rele(pdd, FTAG);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
if (dmu_objset_type(parentos) != DMU_OST_ZFS) {
|
|
|
|
dsl_dataset_rele(parentds, FTAG);
|
|
|
|
dsl_dir_rele(pdd, FTAG);
|
|
|
|
return (SET_ERROR(ZFS_ERR_WRONG_PARENT));
|
|
|
|
}
|
|
|
|
dsl_dataset_rele(parentds, FTAG);
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_dir_rele(pdd, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-04-01 16:07:48 +03:00
|
|
|
return (error);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2013-09-04 16:00:57 +04:00
|
|
|
dmu_objset_create_sync(void *arg, dmu_tx_t *tx)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2013-09-04 16:00:57 +04:00
|
|
|
dmu_objset_create_arg_t *doca = arg;
|
|
|
|
dsl_pool_t *dp = dmu_tx_pool(tx);
|
2018-10-03 19:47:11 +03:00
|
|
|
spa_t *spa = dp->dp_spa;
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_dir_t *pdd;
|
|
|
|
const char *tail;
|
2013-08-28 15:45:09 +04:00
|
|
|
dsl_dataset_t *ds;
|
2013-09-04 16:00:57 +04:00
|
|
|
uint64_t obj;
|
2013-08-28 15:45:09 +04:00
|
|
|
blkptr_t *bp;
|
2013-09-04 16:00:57 +04:00
|
|
|
objset_t *os;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
zio_t *rzio;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
VERIFY0(dsl_dir_hold(dp, doca->doca_name, FTAG, &pdd, &tail));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
obj = dsl_dataset_create_sync(pdd, tail, NULL, doca->doca_flags,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
doca->doca_cred, doca->doca_dcp, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
VERIFY0(dsl_dataset_hold_obj_flags(pdd->dd_pool, obj,
|
|
|
|
DS_HOLD_FLAG_DECRYPT, FTAG, &ds));
|
2017-01-27 22:43:42 +03:00
|
|
|
rrw_enter(&ds->ds_bp_rwlock, RW_READER, FTAG);
|
2013-08-28 15:45:09 +04:00
|
|
|
bp = dsl_dataset_get_blkptr(ds);
|
2018-10-03 19:47:11 +03:00
|
|
|
os = dmu_objset_create_impl(spa, ds, bp, doca->doca_type, tx);
|
2017-01-27 22:43:42 +03:00
|
|
|
rrw_exit(&ds->ds_bp_rwlock, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
if (doca->doca_userfunc != NULL) {
|
|
|
|
doca->doca_userfunc(os, doca->doca_userarg,
|
|
|
|
doca->doca_cred, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
2017-09-12 23:15:11 +03:00
|
|
|
* The doca_userfunc() may write out some data that needs to be
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
* encrypted if the dataset is encrypted (specifically the root
|
|
|
|
* directory). This data must be written out before the encryption
|
|
|
|
* key mapping is removed by dsl_dataset_rele_flags(). Force the
|
|
|
|
* I/O to occur immediately by invoking the relevant sections of
|
|
|
|
* dsl_pool_sync().
|
|
|
|
*/
|
|
|
|
if (os->os_encrypted) {
|
|
|
|
dsl_dataset_t *tmpds = NULL;
|
|
|
|
boolean_t need_sync_done = B_FALSE;
|
|
|
|
|
2017-09-12 23:15:11 +03:00
|
|
|
mutex_enter(&ds->ds_lock);
|
|
|
|
ds->ds_owner = FTAG;
|
|
|
|
mutex_exit(&ds->ds_lock);
|
|
|
|
|
2018-10-03 19:47:11 +03:00
|
|
|
rzio = zio_root(spa, NULL, NULL, ZIO_FLAG_MUSTSUCCEED);
|
2017-09-12 23:15:11 +03:00
|
|
|
tmpds = txg_list_remove_this(&dp->dp_dirty_datasets, ds,
|
|
|
|
tx->tx_txg);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (tmpds != NULL) {
|
|
|
|
dsl_dataset_sync(ds, rzio, tx);
|
|
|
|
need_sync_done = B_TRUE;
|
|
|
|
}
|
|
|
|
VERIFY0(zio_wait(rzio));
|
|
|
|
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
dmu_objset_sync_done(os, tx);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
taskq_wait(dp->dp_sync_taskq);
|
2018-10-03 19:47:11 +03:00
|
|
|
if (txg_list_member(&dp->dp_dirty_datasets, ds, tx->tx_txg)) {
|
|
|
|
ASSERT3P(ds->ds_key_mapping, !=, NULL);
|
|
|
|
key_mapping_rele(spa, ds->ds_key_mapping, ds);
|
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
2018-10-03 19:47:11 +03:00
|
|
|
rzio = zio_root(spa, NULL, NULL, ZIO_FLAG_MUSTSUCCEED);
|
2017-09-12 23:15:11 +03:00
|
|
|
tmpds = txg_list_remove_this(&dp->dp_dirty_datasets, ds,
|
|
|
|
tx->tx_txg);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (tmpds != NULL) {
|
|
|
|
dmu_buf_rele(ds->ds_dbuf, ds);
|
|
|
|
dsl_dataset_sync(ds, rzio, tx);
|
|
|
|
}
|
|
|
|
VERIFY0(zio_wait(rzio));
|
|
|
|
|
2018-10-03 19:47:11 +03:00
|
|
|
if (need_sync_done) {
|
|
|
|
ASSERT3P(ds->ds_key_mapping, !=, NULL);
|
|
|
|
key_mapping_rele(spa, ds->ds_key_mapping, ds);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dsl_dataset_sync_done(ds, tx);
|
2023-02-24 04:14:52 +03:00
|
|
|
dmu_buf_rele(ds->ds_dbuf, ds);
|
2018-10-03 19:47:11 +03:00
|
|
|
}
|
2017-09-12 23:15:11 +03:00
|
|
|
|
|
|
|
mutex_enter(&ds->ds_lock);
|
|
|
|
ds->ds_owner = NULL;
|
|
|
|
mutex_exit(&ds->ds_lock);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
|
|
|
|
2019-09-12 23:28:26 +03:00
|
|
|
spa_history_log_internal_ds(ds, "create", tx, " ");
|
2014-03-22 13:07:14 +04:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dsl_dataset_rele_flags(ds, DS_HOLD_FLAG_DECRYPT, FTAG);
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_dir_rele(pdd, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
int
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_objset_create(const char *name, dmu_objset_type_t type, uint64_t flags,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dsl_crypto_params_t *dcp, dmu_objset_create_sync_func_t func, void *arg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2013-09-04 16:00:57 +04:00
|
|
|
dmu_objset_create_arg_t doca;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dsl_crypto_params_t tmp_dcp = { 0 };
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
doca.doca_name = name;
|
|
|
|
doca.doca_cred = CRED();
|
2020-07-12 03:18:02 +03:00
|
|
|
doca.doca_proc = curproc;
|
2013-09-04 16:00:57 +04:00
|
|
|
doca.doca_flags = flags;
|
|
|
|
doca.doca_userfunc = func;
|
|
|
|
doca.doca_userarg = arg;
|
|
|
|
doca.doca_type = type;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* Some callers (mostly for testing) do not provide a dcp on their
|
|
|
|
* own but various code inside the sync task will require it to be
|
|
|
|
* allocated. Rather than adding NULL checks throughout this code
|
|
|
|
* or adding dummy dcp's to all of the callers we simply create a
|
|
|
|
* dummy one here and use that. This zero dcp will have the same
|
Adopt pyzfs from ClusterHQ
This commit introduces several changes:
* Update LICENSE and project information
* Give a good PEP8 talk to existing Python source code
* Add RPM/DEB packaging for pyzfs
* Fix some outstanding issues with the existing pyzfs code caused by
changes in the ABI since the last time the code was updated
* Integrate pyzfs Python unittest with the ZFS Test Suite
* Add missing libzfs_core functions: lzc_change_key,
lzc_channel_program, lzc_channel_program_nosync, lzc_load_key,
lzc_receive_one, lzc_receive_resumable, lzc_receive_with_cmdprops,
lzc_receive_with_header, lzc_reopen, lzc_send_resume, lzc_sync,
lzc_unload_key, lzc_remap
Note: this commit slightly changes zfs_ioc_unload_key() ABI. This allow
to differentiate the case where we tried to unload a key on a
non-existing dataset (ENOENT) from the situation where a dataset has
no key loaded: this is consistent with the "change" case where trying
to zfs_ioc_change_key() from a dataset with no key results in EACCES.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7230
2018-03-18 11:34:45 +03:00
|
|
|
* effect as asking for inheritance of all encryption params.
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
*/
|
|
|
|
doca.doca_dcp = (dcp != NULL) ? dcp : &tmp_dcp;
|
|
|
|
|
async zvol minor node creation interferes with receive
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE). This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...). These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.
After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream). This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:
1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset. This
causes the 2nd receive to fail.
2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl). This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.
3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.
To address these problems, this change synchronously creates the minor
node. To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.
Implementation notes:
We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys. The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.
Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set. We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.
There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems. E.g. changing the volmode or snapdev
properties. These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations. In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.
The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem. However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes #7863
Closes #9885
2020-02-03 20:33:14 +03:00
|
|
|
int rv = dsl_sync_task(name,
|
2014-11-03 23:28:43 +03:00
|
|
|
dmu_objset_create_check, dmu_objset_create_sync, &doca,
|
async zvol minor node creation interferes with receive
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE). This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...). These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.
After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream). This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:
1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset. This
causes the 2nd receive to fail.
2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl). This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.
3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.
To address these problems, this change synchronously creates the minor
node. To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.
Implementation notes:
We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys. The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.
Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set. We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.
There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems. E.g. changing the volmode or snapdev
properties. These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations. In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.
The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem. However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes #7863
Closes #9885
2020-02-03 20:33:14 +03:00
|
|
|
6, ZFS_SPACE_CHECK_NORMAL);
|
|
|
|
|
|
|
|
if (rv == 0)
|
|
|
|
zvol_create_minor(name);
|
|
|
|
return (rv);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
typedef struct dmu_objset_clone_arg {
|
|
|
|
const char *doca_clone;
|
|
|
|
const char *doca_origin;
|
|
|
|
cred_t *doca_cred;
|
2020-07-12 03:18:02 +03:00
|
|
|
proc_t *doca_proc;
|
2013-09-04 16:00:57 +04:00
|
|
|
} dmu_objset_clone_arg_t;
|
|
|
|
|
|
|
|
static int
|
|
|
|
dmu_objset_clone_check(void *arg, dmu_tx_t *tx)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2013-09-04 16:00:57 +04:00
|
|
|
dmu_objset_clone_arg_t *doca = arg;
|
2010-05-29 00:45:14 +04:00
|
|
|
dsl_dir_t *pdd;
|
|
|
|
const char *tail;
|
2013-09-04 16:00:57 +04:00
|
|
|
int error;
|
|
|
|
dsl_dataset_t *origin;
|
|
|
|
dsl_pool_t *dp = dmu_tx_pool(tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
if (strchr(doca->doca_clone, '@') != NULL)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EINVAL));
|
2013-09-04 16:00:57 +04:00
|
|
|
|
2016-06-16 00:28:36 +03:00
|
|
|
if (strlen(doca->doca_clone) >= ZFS_MAX_DATASET_NAME_LEN)
|
|
|
|
return (SET_ERROR(ENAMETOOLONG));
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_dir_hold(dp, doca->doca_clone, FTAG, &pdd, &tail);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
2010-05-29 00:45:14 +04:00
|
|
|
if (tail == NULL) {
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_dir_rele(pdd, FTAG);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EEXIST));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2015-07-11 02:45:01 +03:00
|
|
|
|
2015-04-01 16:07:48 +03:00
|
|
|
error = dsl_fs_ss_limit_check(pdd, 1, ZFS_PROP_FILESYSTEM_LIMIT, NULL,
|
2020-07-12 03:18:02 +03:00
|
|
|
doca->doca_cred, doca->doca_proc);
|
2015-04-01 16:07:48 +03:00
|
|
|
if (error != 0) {
|
|
|
|
dsl_dir_rele(pdd, FTAG);
|
|
|
|
return (SET_ERROR(EDQUOT));
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = dsl_dataset_hold(dp, doca->doca_origin, FTAG, &origin);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (error != 0) {
|
|
|
|
dsl_dir_rele(pdd, FTAG);
|
2010-08-27 01:24:34 +04:00
|
|
|
return (error);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
}
|
2010-08-27 01:24:34 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
/* You can only clone snapshots, not the head datasets. */
|
2015-04-02 06:44:32 +03:00
|
|
|
if (!origin->ds_is_snapshot) {
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_dataset_rele(origin, FTAG);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dsl_dir_rele(pdd, FTAG);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EINVAL));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_dataset_rele(origin, FTAG);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dsl_dir_rele(pdd, FTAG);
|
2010-08-27 01:24:34 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
return (0);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
static void
|
|
|
|
dmu_objset_clone_sync(void *arg, dmu_tx_t *tx)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2013-09-04 16:00:57 +04:00
|
|
|
dmu_objset_clone_arg_t *doca = arg;
|
|
|
|
dsl_pool_t *dp = dmu_tx_pool(tx);
|
|
|
|
dsl_dir_t *pdd;
|
|
|
|
const char *tail;
|
|
|
|
dsl_dataset_t *origin, *ds;
|
|
|
|
uint64_t obj;
|
2016-06-16 00:28:36 +03:00
|
|
|
char namebuf[ZFS_MAX_DATASET_NAME_LEN];
|
2013-08-28 15:45:09 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
VERIFY0(dsl_dir_hold(dp, doca->doca_clone, FTAG, &pdd, &tail));
|
|
|
|
VERIFY0(dsl_dataset_hold(dp, doca->doca_origin, FTAG, &origin));
|
2013-08-28 15:45:09 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
obj = dsl_dataset_create_sync(pdd, tail, origin, 0,
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
doca->doca_cred, NULL, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
VERIFY0(dsl_dataset_hold_obj(pdd->dd_pool, obj, FTAG, &ds));
|
|
|
|
dsl_dataset_name(origin, namebuf);
|
|
|
|
spa_history_log_internal_ds(ds, "clone", tx,
|
2019-09-12 23:28:26 +03:00
|
|
|
"origin=%s (%llu)", namebuf, (u_longlong_t)origin->ds_object);
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_dataset_rele(ds, FTAG);
|
|
|
|
dsl_dataset_rele(origin, FTAG);
|
|
|
|
dsl_dir_rele(pdd, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
int
|
2013-09-04 16:00:57 +04:00
|
|
|
dmu_objset_clone(const char *clone, const char *origin)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2013-09-04 16:00:57 +04:00
|
|
|
dmu_objset_clone_arg_t doca;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
doca.doca_clone = clone;
|
|
|
|
doca.doca_origin = origin;
|
|
|
|
doca.doca_cred = CRED();
|
2020-07-12 03:18:02 +03:00
|
|
|
doca.doca_proc = curproc;
|
2010-08-27 01:24:34 +04:00
|
|
|
|
async zvol minor node creation interferes with receive
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE). This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...). These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.
After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream). This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:
1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset. This
causes the 2nd receive to fail.
2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl). This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.
3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.
To address these problems, this change synchronously creates the minor
node. To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.
Implementation notes:
We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys. The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.
Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set. We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.
There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems. E.g. changing the volmode or snapdev
properties. These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations. In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.
The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem. However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes #7863
Closes #9885
2020-02-03 20:33:14 +03:00
|
|
|
int rv = dsl_sync_task(clone,
|
2014-11-03 23:28:43 +03:00
|
|
|
dmu_objset_clone_check, dmu_objset_clone_sync, &doca,
|
async zvol minor node creation interferes with receive
When we finish a zfs receive, dmu_recv_end_sync() calls
zvol_create_minors(async=TRUE). This kicks off some other threads that
create the minor device nodes (in /dev/zvol/poolname/...). These async
threads call zvol_prefetch_minors_impl() and zvol_create_minor(), which
both call dmu_objset_own(), which puts a "long hold" on the dataset.
Since the zvol minor node creation is asynchronous, this can happen
after the `ZFS_IOC_RECV[_NEW]` ioctl and `zfs receive` process have
completed.
After the first receive ioctl has completed, userland may attempt to do
another receive into the same dataset (e.g. the next incremental
stream). This second receive and the asynchronous minor node creation
can interfere with one another in several different ways, because they
both require exclusive access to the dataset:
1. When the second receive is finishing up, dmu_recv_end_check() does
dsl_dataset_handoff_check(), which can fail with EBUSY if the async
minor node creation already has a "long hold" on this dataset. This
causes the 2nd receive to fail.
2. The async udev rule can fail if zvol_id and/or systemd-udevd try to
open the device while the the second receive's async attempt at minor
node creation owns the dataset (via zvol_prefetch_minors_impl). This
causes the minor node (/dev/zd*) to exist, but the udev-generated
/dev/zvol/... to not exist.
3. The async minor node creation can silently fail with EBUSY if the
first receive's zvol_create_minor() trys to own the dataset while the
second receive's zvol_prefetch_minors_impl already owns the dataset.
To address these problems, this change synchronously creates the minor
node. To avoid the lock ordering problems that the asynchrony was
introduced to fix (see #3681), we create the minor nodes from open
context, with no locks held, rather than from syncing contex as was
originally done.
Implementation notes:
We generally do not need to traverse children or prefetch anything (e.g.
when running the recv, snapshot, create, or clone subcommands of zfs).
We only need recursion when importing/opening a pool and when loading
encryption keys. The existing recursive, asynchronous, prefetching code
is preserved for use in these cases.
Channel programs may need to create zvol minor nodes, when creating a
snapshot of a zvol with the snapdev property set. We figure out what
snapshots are created when running the LUA program in syncing context.
In this case we need to remember what snapshots were created, and then
try to create their minor nodes from open context, after the LUA code
has completed.
There are additional zvol use cases that asynchronously own the dataset,
which can cause similar problems. E.g. changing the volmode or snapdev
properties. These are less problematic because they are not recursive
and don't touch datasets that are not involved in the operation, there
is still potential for interference with subsequent operations. In the
future, these cases should be similarly converted to create the zvol
minor node synchronously from open context.
The async tasks of removing and renaming minors do not own the objset,
so they do not have this problem. However, it may make sense to also
convert these operations to happen synchronously from open context, in
the future.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65948
Closes #7863
Closes #9885
2020-02-03 20:33:14 +03:00
|
|
|
6, ZFS_SPACE_CHECK_NORMAL);
|
|
|
|
|
|
|
|
if (rv == 0)
|
|
|
|
zvol_create_minor(clone);
|
|
|
|
|
|
|
|
return (rv);
|
2013-08-28 15:45:09 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
dmu_objset_snapshot_one(const char *fsname, const char *snapname)
|
|
|
|
{
|
|
|
|
int err;
|
|
|
|
char *longsnap = kmem_asprintf("%s@%s", fsname, snapname);
|
|
|
|
nvlist_t *snaps = fnvlist_alloc();
|
|
|
|
|
|
|
|
fnvlist_add_boolean(snaps, longsnap);
|
2019-10-10 19:47:06 +03:00
|
|
|
kmem_strfree(longsnap);
|
2013-09-04 16:00:57 +04:00
|
|
|
err = dsl_dataset_snapshot(snaps, NULL, NULL);
|
|
|
|
fnvlist_free(snaps);
|
2013-08-28 15:45:09 +04:00
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
static void
|
|
|
|
dmu_objset_upgrade_task_cb(void *data)
|
|
|
|
{
|
|
|
|
objset_t *os = data;
|
|
|
|
|
|
|
|
mutex_enter(&os->os_upgrade_lock);
|
|
|
|
os->os_upgrade_status = EINTR;
|
|
|
|
if (!os->os_upgrade_exit) {
|
2020-12-21 21:13:23 +03:00
|
|
|
int status;
|
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
mutex_exit(&os->os_upgrade_lock);
|
|
|
|
|
2020-12-21 21:13:23 +03:00
|
|
|
status = os->os_upgrade_cb(os);
|
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
mutex_enter(&os->os_upgrade_lock);
|
2020-12-21 21:13:23 +03:00
|
|
|
|
|
|
|
os->os_upgrade_status = status;
|
2016-10-04 21:46:10 +03:00
|
|
|
}
|
|
|
|
os->os_upgrade_exit = B_TRUE;
|
|
|
|
os->os_upgrade_id = 0;
|
|
|
|
mutex_exit(&os->os_upgrade_lock);
|
2017-11-11 00:37:10 +03:00
|
|
|
dsl_dataset_long_rele(dmu_objset_ds(os), upgrade_tag);
|
2016-10-04 21:46:10 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dmu_objset_upgrade(objset_t *os, dmu_objset_upgrade_cb_t cb)
|
|
|
|
{
|
|
|
|
if (os->os_upgrade_id != 0)
|
|
|
|
return;
|
|
|
|
|
2017-11-11 00:37:10 +03:00
|
|
|
ASSERT(dsl_pool_config_held(dmu_objset_pool(os)));
|
|
|
|
dsl_dataset_long_hold(dmu_objset_ds(os), upgrade_tag);
|
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
mutex_enter(&os->os_upgrade_lock);
|
|
|
|
if (os->os_upgrade_id == 0 && os->os_upgrade_status == 0) {
|
|
|
|
os->os_upgrade_exit = B_FALSE;
|
|
|
|
os->os_upgrade_cb = cb;
|
|
|
|
os->os_upgrade_id = taskq_dispatch(
|
|
|
|
os->os_spa->spa_upgrade_taskq,
|
|
|
|
dmu_objset_upgrade_task_cb, os, TQ_SLEEP);
|
2017-11-11 00:37:10 +03:00
|
|
|
if (os->os_upgrade_id == TASKQID_INVALID) {
|
|
|
|
dsl_dataset_long_rele(dmu_objset_ds(os), upgrade_tag);
|
2016-10-04 21:46:10 +03:00
|
|
|
os->os_upgrade_status = ENOMEM;
|
2017-11-11 00:37:10 +03:00
|
|
|
}
|
2020-12-21 21:13:23 +03:00
|
|
|
} else {
|
|
|
|
dsl_dataset_long_rele(dmu_objset_ds(os), upgrade_tag);
|
2016-10-04 21:46:10 +03:00
|
|
|
}
|
|
|
|
mutex_exit(&os->os_upgrade_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dmu_objset_upgrade_stop(objset_t *os)
|
|
|
|
{
|
|
|
|
mutex_enter(&os->os_upgrade_lock);
|
|
|
|
os->os_upgrade_exit = B_TRUE;
|
|
|
|
if (os->os_upgrade_id != 0) {
|
|
|
|
taskqid_t id = os->os_upgrade_id;
|
|
|
|
|
|
|
|
os->os_upgrade_id = 0;
|
|
|
|
mutex_exit(&os->os_upgrade_lock);
|
|
|
|
|
2017-11-11 00:37:10 +03:00
|
|
|
if ((taskq_cancel_id(os->os_spa->spa_upgrade_taskq, id)) == 0) {
|
|
|
|
dsl_dataset_long_rele(dmu_objset_ds(os), upgrade_tag);
|
|
|
|
}
|
2017-09-12 23:15:11 +03:00
|
|
|
txg_wait_synced(os->os_spa->spa_dsl_pool, 0);
|
2016-10-04 21:46:10 +03:00
|
|
|
} else {
|
|
|
|
mutex_exit(&os->os_upgrade_lock);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
2017-03-21 04:36:00 +03:00
|
|
|
dmu_objset_sync_dnodes(multilist_sublist_t *list, dmu_tx_t *tx)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
dnode_t *dn;
|
|
|
|
|
2017-03-21 04:36:00 +03:00
|
|
|
while ((dn = multilist_sublist_head(list)) != NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
ASSERT(dn->dn_object != DMU_META_DNODE_OBJECT);
|
|
|
|
ASSERT(dn->dn_dbuf->db_data_pending);
|
|
|
|
/*
|
2009-07-03 02:44:48 +04:00
|
|
|
* Initialize dn_zio outside dnode_sync() because the
|
Log Spacemap Project
= Motivation
At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.
The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.
We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.
Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.
= About this patch
This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).
Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.
= Testing
The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.
= Performance Analysis (Linux Specific)
All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.
After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).
Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.
Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]
Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png
= Porting to Other Platforms
For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:
Make zdb results for checkpoint tests consistent
db587941c5ff6dea01932bb78f70db63cf7f38ba
Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba5914552c6185afbe1dd17b3ed4b0d526547
Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b1e4d7ebc1420ea30e51c6541f1d834
Factor metaslab_load_wait() in metaslab_load()
b194fab0fb6caad18711abccaff3c69ad8b3f6d3
Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe0ebac0b20e0750984bad182cb6564a
Change target size of metaslabs from 256GB to 16GB
c853f382db731e15a87512f4ef1101d14d778a55
zdb -L should skip leak detection altogether
21e7cf5da89f55ce98ec1115726b150e19eefe89
vs_alloc can underflow in L2ARC vdevs
7558997d2f808368867ca7e5234e5793446e8f3f
Simplify log vdev removal code
6c926f426a26ffb6d7d8e563e33fc176164175cb
Get rid of space_map_update() for ms_synced_length
425d3237ee88abc53d8522a7139c926d278b4b7f
Introduce auxiliary metaslab histograms
928e8ad47d3478a3d5d01f0dd6ae74a9371af65e
Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679ba54547f7d361553d21b3291f41ae7
= References
Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project
Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm
Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 20:11:49 +03:00
|
|
|
* meta-dnode needs to set it outside dnode_sync().
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
dn->dn_zio = dn->dn_dbuf->db_data_pending->dr_zio;
|
|
|
|
ASSERT(dn->dn_zio);
|
|
|
|
|
|
|
|
ASSERT3U(dn->dn_nlevels, <=, DN_MAX_LEVELS);
|
2017-03-21 04:36:00 +03:00
|
|
|
multilist_sublist_remove(list, dn);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2018-04-10 21:15:05 +03:00
|
|
|
/*
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
* See the comment above dnode_rele_task() for an explanation
|
|
|
|
* of why this dnode hold is always needed (even when not
|
|
|
|
* doing user accounting).
|
2018-04-10 21:15:05 +03:00
|
|
|
*/
|
2021-06-10 19:42:31 +03:00
|
|
|
multilist_t *newlist = &dn->dn_objset->os_synced_dnodes;
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
(void) dnode_add_ref(dn, newlist);
|
|
|
|
multilist_insert(newlist, dn);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
dnode_sync(dn, tx);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_objset_write_ready(zio_t *zio, arc_buf_t *abuf, void *arg)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) abuf;
|
2008-12-03 23:09:06 +03:00
|
|
|
blkptr_t *bp = zio->io_bp;
|
2010-05-29 00:45:14 +04:00
|
|
|
objset_t *os = arg;
|
2008-11-20 23:01:55 +03:00
|
|
|
dnode_phys_t *dnp = &os->os_phys->os_meta_dnode;
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
uint64_t fill = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2014-06-06 01:19:08 +04:00
|
|
|
ASSERT(!BP_IS_EMBEDDED(bp));
|
2013-09-04 16:00:57 +04:00
|
|
|
ASSERT3U(BP_GET_TYPE(bp), ==, DMU_OT_OBJSET);
|
|
|
|
ASSERT0(BP_GET_LEVEL(bp));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2009-07-03 02:44:48 +04:00
|
|
|
* Update rootbp fill count: it should be the number of objects
|
|
|
|
* allocated in the object set (not counting the "special"
|
|
|
|
* objects that are stored in the objset_phys_t -- the meta
|
2018-02-14 01:54:54 +03:00
|
|
|
* dnode and user/group/project accounting objects).
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2017-11-04 23:25:13 +03:00
|
|
|
for (int i = 0; i < dnp->dn_nblkptr; i++)
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
fill += BP_GET_FILL(&dnp->dn_blkptr[i]);
|
|
|
|
|
|
|
|
BP_SET_FILL(bp, fill);
|
|
|
|
|
2017-01-27 22:43:42 +03:00
|
|
|
if (os->os_dsl_dataset != NULL)
|
|
|
|
rrw_enter(&os->os_dsl_dataset->ds_bp_rwlock, RW_WRITER, FTAG);
|
|
|
|
*os->os_rootbp = *bp;
|
|
|
|
if (os->os_dsl_dataset != NULL)
|
|
|
|
rrw_exit(&os->os_dsl_dataset->ds_bp_rwlock, FTAG);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dmu_objset_write_done(zio_t *zio, arc_buf_t *abuf, void *arg)
|
|
|
|
{
|
2021-12-12 18:06:44 +03:00
|
|
|
(void) abuf;
|
2010-05-29 00:45:14 +04:00
|
|
|
blkptr_t *bp = zio->io_bp;
|
|
|
|
blkptr_t *bp_orig = &zio->io_bp_orig;
|
|
|
|
objset_t *os = arg;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
if (zio->io_flags & ZIO_FLAG_IO_REWRITE) {
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(BP_EQUAL(bp, bp_orig));
|
2008-12-03 23:09:06 +03:00
|
|
|
} else {
|
2010-05-29 00:45:14 +04:00
|
|
|
dsl_dataset_t *ds = os->os_dsl_dataset;
|
|
|
|
dmu_tx_t *tx = os->os_synctx;
|
|
|
|
|
|
|
|
(void) dsl_dataset_block_kill(ds, bp_orig, tx, B_TRUE);
|
|
|
|
dsl_dataset_block_born(ds, bp, tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2017-01-27 22:43:42 +03:00
|
|
|
kmem_free(bp, sizeof (*bp));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2023-11-06 21:38:42 +03:00
|
|
|
typedef struct sync_objset_arg {
|
|
|
|
zio_t *soa_zio;
|
|
|
|
objset_t *soa_os;
|
|
|
|
dmu_tx_t *soa_tx;
|
|
|
|
kmutex_t soa_mutex;
|
|
|
|
int soa_count;
|
|
|
|
taskq_ent_t soa_tq_ent;
|
|
|
|
} sync_objset_arg_t;
|
|
|
|
|
2017-03-21 04:36:00 +03:00
|
|
|
typedef struct sync_dnodes_arg {
|
2023-11-06 21:38:42 +03:00
|
|
|
multilist_t *sda_list;
|
|
|
|
int sda_sublist_idx;
|
|
|
|
multilist_t *sda_newlist;
|
|
|
|
sync_objset_arg_t *sda_soa;
|
2017-03-21 04:36:00 +03:00
|
|
|
} sync_dnodes_arg_t;
|
|
|
|
|
2023-11-06 21:38:42 +03:00
|
|
|
static void sync_meta_dnode_task(void *arg);
|
|
|
|
|
2017-03-21 04:36:00 +03:00
|
|
|
static void
|
|
|
|
sync_dnodes_task(void *arg)
|
|
|
|
{
|
|
|
|
sync_dnodes_arg_t *sda = arg;
|
2023-11-06 21:38:42 +03:00
|
|
|
sync_objset_arg_t *soa = sda->sda_soa;
|
|
|
|
objset_t *os = soa->soa_os;
|
2017-03-21 04:36:00 +03:00
|
|
|
|
2024-05-01 21:07:20 +03:00
|
|
|
uint_t allocator = spa_acq_allocator(os->os_spa);
|
2017-03-21 04:36:00 +03:00
|
|
|
multilist_sublist_t *ms =
|
2024-04-10 02:23:19 +03:00
|
|
|
multilist_sublist_lock_idx(sda->sda_list, sda->sda_sublist_idx);
|
2017-03-21 04:36:00 +03:00
|
|
|
|
2023-11-06 21:38:42 +03:00
|
|
|
dmu_objset_sync_dnodes(ms, soa->soa_tx);
|
2017-03-21 04:36:00 +03:00
|
|
|
|
|
|
|
multilist_sublist_unlock(ms);
|
2024-05-01 21:07:20 +03:00
|
|
|
spa_rel_allocator(os->os_spa, allocator);
|
2017-03-21 04:36:00 +03:00
|
|
|
|
|
|
|
kmem_free(sda, sizeof (*sda));
|
2023-11-06 21:38:42 +03:00
|
|
|
|
|
|
|
mutex_enter(&soa->soa_mutex);
|
|
|
|
ASSERT(soa->soa_count != 0);
|
|
|
|
if (--soa->soa_count != 0) {
|
|
|
|
mutex_exit(&soa->soa_mutex);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
mutex_exit(&soa->soa_mutex);
|
|
|
|
|
|
|
|
taskq_dispatch_ent(dmu_objset_pool(os)->dp_sync_taskq,
|
|
|
|
sync_meta_dnode_task, soa, TQ_FRONT, &soa->soa_tq_ent);
|
2017-03-21 04:36:00 +03:00
|
|
|
}
|
|
|
|
|
2023-11-06 21:38:42 +03:00
|
|
|
/*
|
|
|
|
* Issue the zio_nowait() for all dirty record zios on the meta dnode,
|
|
|
|
* then trigger the callback for the zil_sync. This runs once for each
|
|
|
|
* objset, only after any/all sublists in the objset have been synced.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
sync_meta_dnode_task(void *arg)
|
|
|
|
{
|
|
|
|
sync_objset_arg_t *soa = arg;
|
|
|
|
objset_t *os = soa->soa_os;
|
|
|
|
dmu_tx_t *tx = soa->soa_tx;
|
|
|
|
int txgoff = tx->tx_txg & TXG_MASK;
|
|
|
|
dbuf_dirty_record_t *dr;
|
|
|
|
|
|
|
|
ASSERT0(soa->soa_count);
|
|
|
|
|
|
|
|
list_t *list = &DMU_META_DNODE(os)->dn_dirty_records[txgoff];
|
|
|
|
while ((dr = list_remove_head(list)) != NULL) {
|
|
|
|
ASSERT0(dr->dr_dbuf->db_level);
|
|
|
|
zio_nowait(dr->dr_zio);
|
|
|
|
}
|
|
|
|
|
|
|
|
/* Enable dnode backfill if enough objects have been freed. */
|
|
|
|
if (os->os_freed_dnodes >= dmu_rescan_dnode_threshold) {
|
|
|
|
os->os_rescan_dnodes = B_TRUE;
|
|
|
|
os->os_freed_dnodes = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Free intent log blocks up to this tx.
|
|
|
|
*/
|
|
|
|
zil_sync(os->os_zil, tx);
|
|
|
|
os->os_phys->os_zil_header = os->os_zil_header;
|
|
|
|
zio_nowait(soa->soa_zio);
|
|
|
|
|
|
|
|
mutex_destroy(&soa->soa_mutex);
|
|
|
|
kmem_free(soa, sizeof (*soa));
|
|
|
|
}
|
2017-03-21 04:36:00 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/* called from dsl */
|
|
|
|
void
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_objset_sync(objset_t *os, zio_t *pio, dmu_tx_t *tx)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int txgoff;
|
2014-06-25 22:37:59 +04:00
|
|
|
zbookmark_phys_t zb;
|
2010-05-29 00:45:14 +04:00
|
|
|
zio_prop_t zp;
|
2008-11-20 23:01:55 +03:00
|
|
|
zio_t *zio;
|
2019-06-25 22:03:38 +03:00
|
|
|
int num_sublists;
|
|
|
|
multilist_t *ml;
|
2017-01-27 22:43:42 +03:00
|
|
|
blkptr_t *blkptr_copy = kmem_alloc(sizeof (*os->os_rootbp), KM_SLEEP);
|
|
|
|
*blkptr_copy = *os->os_rootbp;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-06-23 07:53:45 +03:00
|
|
|
dprintf_ds(os->os_dsl_dataset, "txg=%llu\n", (u_longlong_t)tx->tx_txg);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
ASSERT(dmu_tx_is_syncing(tx));
|
|
|
|
/* XXX the write_done callback should really give us the tx... */
|
|
|
|
os->os_synctx = tx;
|
|
|
|
|
|
|
|
if (os->os_dsl_dataset == NULL) {
|
|
|
|
/*
|
|
|
|
* This is the MOS. If we have upgraded,
|
|
|
|
* spa_max_replication() could change, so reset
|
|
|
|
* os_copies here.
|
|
|
|
*/
|
|
|
|
os->os_copies = spa_max_replication(os->os_spa);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Create the root block IO
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
SET_BOOKMARK(&zb, os->os_dsl_dataset ?
|
|
|
|
os->os_dsl_dataset->ds_object : DMU_META_OBJSET,
|
|
|
|
ZB_ROOT_OBJECT, ZB_ROOT_LEVEL, ZB_ROOT_BLKID);
|
2013-07-03 00:26:24 +04:00
|
|
|
arc_release(os->os_phys_buf, &os->os_phys_buf);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2017-03-23 19:07:27 +03:00
|
|
|
dmu_write_policy(os, NULL, 0, 0, &zp);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
2018-04-17 21:06:54 +03:00
|
|
|
* If we are either claiming the ZIL or doing a raw receive, write
|
|
|
|
* out the os_phys_buf raw. Neither of these actions will effect the
|
|
|
|
* MAC at this point.
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
*/
|
2018-04-17 21:06:54 +03:00
|
|
|
if (os->os_raw_receive ||
|
|
|
|
os->os_next_write_raw[tx->tx_txg & TXG_MASK]) {
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
ASSERT(os->os_encrypted);
|
|
|
|
arc_convert_to_raw(os->os_phys_buf,
|
|
|
|
os->os_dsl_dataset->ds_object, ZFS_HOST_BYTEORDER,
|
|
|
|
DMU_OT_OBJSET, NULL, NULL, NULL);
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
zio = arc_write(pio, os->os_spa, tx->tx_txg,
|
Implement uncached prefetch
Previously the primarycache property was handled only in the dbuf
layer. Since the speculative prefetcher is implemented in the ARC,
it had to be disabled for uncacheable buffers.
This change gives the ARC knowledge about uncacheable buffers
via arc_read() and arc_write(). So when remove_reference() drops
the last reference on the ARC header, it can either immediately destroy
it, or if it is marked as prefetch, put it into a new arc_uncached state.
That state is scanned every second, evicting stale buffers that were
not demand read.
This change also tracks dbufs that were read from the beginning,
but not to the end. It is assumed that such buffers may receive further
reads, and so they are stored in dbuf cache. If a following
reads reaches the end of the buffer, it is immediately evicted.
Otherwise it will follow regular dbuf cache eviction. Since the dbuf
layer does not know actual file sizes, this logic is not applied to
the final buffer of a dnode.
Since uncacheable buffers should no longer stay in the ARC for long,
this patch also tries to optimize I/O by allocating ARC physical
buffers as linear to allow buffer sharing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14243
2023-01-05 03:29:54 +03:00
|
|
|
blkptr_copy, os->os_phys_buf, B_FALSE, dmu_os_is_l2cacheable(os),
|
Remove ARC/ZIO physdone callbacks.
Those callbacks were introduced many years ago as part of a bigger
patch to smoothen the write throttling within a txg. They allow to
account completion of individual physical writes within a logical
one, improving cases when some of physical writes complete much
sooner than others, gradually opening the write throttle.
Few years after that ZFS got allocation throttling, working on a
level of logical writes and limiting number of writes queued to
vdevs at any point, and so limiting latency distribution between
the physical writes and especially writes of multiple copies.
The addition of scheduling deadline I proposed in #14925 should
further reduce the latency distribution. Grown memory sizes over
the past 10 years should also reduce importance of the smoothing.
While the use of physdone callback may still in theory provide
some smoother throttling, there are cases where we simply can not
afford it. Since dirty data accounting is protected by pool-wide
lock, in case of 6-wide RAIDZ, for example, it requires us to take
it 8 times per logical block write, creating huge lock contention.
My tests of this patch show radical reduction of the lock spinning
time on workloads when smaller blocks are written to RAIDZ pools,
when each of the disks receives 8-16KB chunks, but the total rate
reaching 100K+ blocks per second. Same time attempts to measure
any write time fluctuations didn't show anything noticeable.
While there, remove also io_child_count/io_parent_count counters.
They are used only for couple assertions that can be avoided.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14948
2023-06-15 20:49:03 +03:00
|
|
|
&zp, dmu_objset_write_ready, NULL, dmu_objset_write_done,
|
2016-05-15 18:02:28 +03:00
|
|
|
os, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_MUSTSUCCEED, &zb);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2009-07-03 02:44:48 +04:00
|
|
|
* Sync special dnodes - the parent IO for the sync is the root block
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2010-08-27 01:24:34 +04:00
|
|
|
DMU_META_DNODE(os)->dn_zio = zio;
|
|
|
|
dnode_sync(DMU_META_DNODE(os), tx);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
os->os_phys->os_flags = os->os_flags;
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
if (DMU_USERUSED_DNODE(os) &&
|
|
|
|
DMU_USERUSED_DNODE(os)->dn_type != DMU_OT_NONE) {
|
|
|
|
DMU_USERUSED_DNODE(os)->dn_zio = zio;
|
|
|
|
dnode_sync(DMU_USERUSED_DNODE(os), tx);
|
|
|
|
DMU_GROUPUSED_DNODE(os)->dn_zio = zio;
|
|
|
|
dnode_sync(DMU_GROUPUSED_DNODE(os), tx);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
2018-02-14 01:54:54 +03:00
|
|
|
if (DMU_PROJECTUSED_DNODE(os) &&
|
|
|
|
DMU_PROJECTUSED_DNODE(os)->dn_type != DMU_OT_NONE) {
|
|
|
|
DMU_PROJECTUSED_DNODE(os)->dn_zio = zio;
|
|
|
|
dnode_sync(DMU_PROJECTUSED_DNODE(os), tx);
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
txgoff = tx->tx_txg & TXG_MASK;
|
|
|
|
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
/*
|
|
|
|
* We must create the list here because it uses the
|
|
|
|
* dn_dirty_link[] of this txg. But it may already
|
|
|
|
* exist because we call dsl_dataset_sync() twice per txg.
|
|
|
|
*/
|
2021-06-10 19:42:31 +03:00
|
|
|
if (os->os_synced_dnodes.ml_sublists == NULL) {
|
|
|
|
multilist_create(&os->os_synced_dnodes, sizeof (dnode_t),
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
offsetof(dnode_t, dn_dirty_link[txgoff]),
|
|
|
|
dnode_multilist_index_func);
|
|
|
|
} else {
|
2021-06-10 19:42:31 +03:00
|
|
|
ASSERT3U(os->os_synced_dnodes.ml_offset, ==,
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
offsetof(dnode_t, dn_dirty_link[txgoff]));
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
2023-11-06 21:38:42 +03:00
|
|
|
/*
|
|
|
|
* zio_nowait(zio) is done after any/all sublist and meta dnode
|
|
|
|
* zios have been nowaited, and the zil_sync() has been performed.
|
|
|
|
* The soa is freed at the end of sync_meta_dnode_task.
|
|
|
|
*/
|
|
|
|
sync_objset_arg_t *soa = kmem_alloc(sizeof (*soa), KM_SLEEP);
|
|
|
|
soa->soa_zio = zio;
|
|
|
|
soa->soa_os = os;
|
|
|
|
soa->soa_tx = tx;
|
|
|
|
taskq_init_ent(&soa->soa_tq_ent);
|
|
|
|
mutex_init(&soa->soa_mutex, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
|
2021-06-10 19:42:31 +03:00
|
|
|
ml = &os->os_dirty_dnodes[txgoff];
|
2023-11-06 21:38:42 +03:00
|
|
|
soa->soa_count = num_sublists = multilist_get_num_sublists(ml);
|
|
|
|
|
2019-06-25 22:03:38 +03:00
|
|
|
for (int i = 0; i < num_sublists; i++) {
|
|
|
|
if (multilist_sublist_is_empty_idx(ml, i))
|
2023-11-06 21:38:42 +03:00
|
|
|
soa->soa_count--;
|
2017-03-21 04:36:00 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2023-11-06 21:38:42 +03:00
|
|
|
if (soa->soa_count == 0) {
|
|
|
|
taskq_dispatch_ent(dmu_objset_pool(os)->dp_sync_taskq,
|
|
|
|
sync_meta_dnode_task, soa, TQ_FRONT, &soa->soa_tq_ent);
|
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Sync sublists in parallel. The last to finish
|
|
|
|
* (i.e., when soa->soa_count reaches zero) must
|
|
|
|
* dispatch sync_meta_dnode_task.
|
|
|
|
*/
|
|
|
|
for (int i = 0; i < num_sublists; i++) {
|
|
|
|
if (multilist_sublist_is_empty_idx(ml, i))
|
|
|
|
continue;
|
|
|
|
sync_dnodes_arg_t *sda =
|
|
|
|
kmem_alloc(sizeof (*sda), KM_SLEEP);
|
|
|
|
sda->sda_list = ml;
|
|
|
|
sda->sda_sublist_idx = i;
|
|
|
|
sda->sda_soa = soa;
|
|
|
|
(void) taskq_dispatch(
|
|
|
|
dmu_objset_pool(os)->dp_sync_taskq,
|
|
|
|
sync_dnodes_task, sda, 0);
|
|
|
|
/* sync_dnodes_task frees sda */
|
|
|
|
}
|
2016-05-17 04:02:29 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
boolean_t
|
|
|
|
dmu_objset_is_dirty(objset_t *os, uint64_t txg)
|
|
|
|
{
|
2021-06-10 19:42:31 +03:00
|
|
|
return (!multilist_is_empty(&os->os_dirty_dnodes[txg & TXG_MASK]));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
File incorrectly zeroed when receiving incremental stream that toggles -L
Background:
By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks. By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature. A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.
When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out. The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).
Changes:
This commit addresses the problem with several changes to the semantics
of zfs send/receive:
1. "-L to no-L" incrementals are rejected. If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:
incremental send stream requires -L (--large-block), to match
previous receive.
2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.
3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals. This flag is currently not set on any send streams. In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.
Implementation notes:
To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.
In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks. The zio pipeline will recompress
each smaller block individually.
A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #6224
Closes #10383
2020-06-09 20:41:01 +03:00
|
|
|
static file_info_cb_t *file_cbs[DMU_OST_NUMTYPES];
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
void
|
File incorrectly zeroed when receiving incremental stream that toggles -L
Background:
By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks. By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature. A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.
When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out. The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).
Changes:
This commit addresses the problem with several changes to the semantics
of zfs send/receive:
1. "-L to no-L" incrementals are rejected. If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:
incremental send stream requires -L (--large-block), to match
previous receive.
2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.
3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals. This flag is currently not set on any send streams. In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.
Implementation notes:
To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.
In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks. The zio pipeline will recompress
each smaller block individually.
A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #6224
Closes #10383
2020-06-09 20:41:01 +03:00
|
|
|
dmu_objset_register_type(dmu_objset_type_t ost, file_info_cb_t *cb)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
File incorrectly zeroed when receiving incremental stream that toggles -L
Background:
By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks. By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature. A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.
When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out. The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).
Changes:
This commit addresses the problem with several changes to the semantics
of zfs send/receive:
1. "-L to no-L" incrementals are rejected. If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:
incremental send stream requires -L (--large-block), to match
previous receive.
2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.
3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals. This flag is currently not set on any send streams. In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.
Implementation notes:
To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.
In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks. The zio pipeline will recompress
each smaller block individually.
A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #6224
Closes #10383
2020-06-09 20:41:01 +03:00
|
|
|
file_cbs[ost] = cb;
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
dmu_get_file_info(objset_t *os, dmu_object_type_t bonustype, const void *data,
|
|
|
|
zfs_file_info_t *zfi)
|
|
|
|
{
|
|
|
|
file_info_cb_t *cb = file_cbs[os->os_phys->os_type];
|
|
|
|
if (cb == NULL)
|
|
|
|
return (EINVAL);
|
|
|
|
return (cb(bonustype, data, zfi));
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
boolean_t
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_objset_userused_enabled(objset_t *os)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
|
|
|
return (spa_version(os->os_spa) >= SPA_VERSION_USERSPACE &&
|
File incorrectly zeroed when receiving incremental stream that toggles -L
Background:
By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks. By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature. A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.
When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out. The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).
Changes:
This commit addresses the problem with several changes to the semantics
of zfs send/receive:
1. "-L to no-L" incrementals are rejected. If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:
incremental send stream requires -L (--large-block), to match
previous receive.
2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.
3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals. This flag is currently not set on any send streams. In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.
Implementation notes:
To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.
In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks. The zio pipeline will recompress
each smaller block individually.
A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #6224
Closes #10383
2020-06-09 20:41:01 +03:00
|
|
|
file_cbs[os->os_phys->os_type] != NULL &&
|
2010-08-27 01:24:34 +04:00
|
|
|
DMU_USERUSED_DNODE(os) != NULL);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
boolean_t
|
|
|
|
dmu_objset_userobjused_enabled(objset_t *os)
|
|
|
|
{
|
|
|
|
return (dmu_objset_userused_enabled(os) &&
|
|
|
|
spa_feature_is_enabled(os->os_spa, SPA_FEATURE_USEROBJ_ACCOUNTING));
|
|
|
|
}
|
|
|
|
|
2018-02-14 01:54:54 +03:00
|
|
|
boolean_t
|
|
|
|
dmu_objset_projectquota_enabled(objset_t *os)
|
|
|
|
{
|
File incorrectly zeroed when receiving incremental stream that toggles -L
Background:
By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks. By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature. A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.
When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out. The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).
Changes:
This commit addresses the problem with several changes to the semantics
of zfs send/receive:
1. "-L to no-L" incrementals are rejected. If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:
incremental send stream requires -L (--large-block), to match
previous receive.
2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.
3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals. This flag is currently not set on any send streams. In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.
Implementation notes:
To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.
In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks. The zio pipeline will recompress
each smaller block individually.
A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #6224
Closes #10383
2020-06-09 20:41:01 +03:00
|
|
|
return (file_cbs[os->os_phys->os_type] != NULL &&
|
2018-02-14 01:54:54 +03:00
|
|
|
DMU_PROJECTUSED_DNODE(os) != NULL &&
|
|
|
|
spa_feature_is_enabled(os->os_spa, SPA_FEATURE_PROJECT_QUOTA));
|
|
|
|
}
|
|
|
|
|
2016-09-21 23:49:47 +03:00
|
|
|
typedef struct userquota_node {
|
|
|
|
/* must be in the first filed, see userquota_update_cache() */
|
|
|
|
char uqn_id[20 + DMU_OBJACCT_PREFIX_LEN];
|
|
|
|
int64_t uqn_delta;
|
|
|
|
avl_node_t uqn_node;
|
|
|
|
} userquota_node_t;
|
|
|
|
|
|
|
|
typedef struct userquota_cache {
|
|
|
|
avl_tree_t uqc_user_deltas;
|
|
|
|
avl_tree_t uqc_group_deltas;
|
2018-02-14 01:54:54 +03:00
|
|
|
avl_tree_t uqc_project_deltas;
|
2016-09-21 23:49:47 +03:00
|
|
|
} userquota_cache_t;
|
|
|
|
|
|
|
|
static int
|
|
|
|
userquota_compare(const void *l, const void *r)
|
|
|
|
{
|
|
|
|
const userquota_node_t *luqn = l;
|
|
|
|
const userquota_node_t *ruqn = r;
|
2016-10-21 18:23:27 +03:00
|
|
|
int rv;
|
2016-09-21 23:49:47 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* NB: can only access uqn_id because userquota_update_cache() doesn't
|
|
|
|
* pass in an entire userquota_node_t.
|
|
|
|
*/
|
2016-10-21 18:23:27 +03:00
|
|
|
rv = strcmp(luqn->uqn_id, ruqn->uqn_id);
|
|
|
|
|
Reduce loaded range tree memory usage
This patch implements a new tree structure for ZFS, and uses it to
store range trees more efficiently.
The new structure is approximately a B-tree, though there are some
small differences from the usual characterizations. The tree has core
nodes and leaf nodes; each contain data elements, which the elements
in the core nodes acting as separators between its children. The
difference between core and leaf nodes is that the core nodes have an
array of children, while leaf nodes don't. Every node in the tree may
be only partially full; in most cases, they are all at least 50% full
(in terms of element count) except for the root node, which can be
less full. Underfull nodes will steal from their neighbors or merge to
remain full enough, while overfull nodes will split in two. The data
elements are contained in tree-controlled buffers; they are copied
into these on insertion, and overwritten on deletion. This means that
the elements are not independently allocated, which reduces overhead,
but also means they can't be shared between trees (and also that
pointers to them are only valid until a side-effectful tree operation
occurs). The overhead varies based on how dense the tree is, but is
usually on the order of about 50% of the element size; the per-node
overheads are very small, and so don't make a significant difference.
The trees can accept arbitrary records; they accept a size and a
comparator to allow them to be used for a variety of purposes.
The new trees replace the AVL trees used in the range trees today.
Currently, the range_seg_t structure contains three 8 byte integers
of payload and two 24 byte avl_tree_node_ts to handle its storage in
both an offset-sorted tree and a size-sorted tree (total size: 64
bytes). In the new model, the range seg structures are usually two 4
byte integers, but a separate one needs to exist for the size-sorted
and offset-sorted tree. Between the raw size, the 50% overhead, and
the double storage, the new btrees are expected to use 8*1.5*2 = 24
bytes per record, or 33.3% as much memory as the AVL trees (this is
for the purposes of storing metaslab range trees; for other purposes,
like scrubs, they use ~50% as much memory).
We reduced the size of the payload in the range segments by teaching
range trees about starting offsets and shifts; since metaslabs have a
fixed starting offset, and they all operate in terms of disk sectors,
we can store the ranges using 4-byte integers as long as the size of
the metaslab divided by the sector size is less than 2^32. For 512-byte
sectors, this is a 2^41 (or 2TB) metaslab, which with the default
settings corresponds to a 256PB disk. 4k sector disks can handle
metaslabs up to 2^46 bytes, or 2^63 byte disks. Since we do not
anticipate disks of this size in the near future, there should be
almost no cases where metaslabs need 64-byte integers to store their
ranges. We do still have the capability to store 64-byte integer ranges
to account for cases where we are storing per-vdev (or per-dnode) trees,
which could reasonably go above the limits discussed. We also do not
store fill information in the compact version of the node, since it
is only used for sorted scrub.
We also optimized the metaslab loading process in various other ways
to offset some inefficiencies in the btree model. While individual
operations (find, insert, remove_from) are faster for the btree than
they are for the avl tree, remove usually requires a find operation,
while in the AVL tree model the element itself suffices. Some clever
changes actually caused an overall speedup in metaslab loading; we use
approximately 40% less cpu to load metaslabs in our tests on Illumos.
Another memory and performance optimization was achieved by changing
what is stored in the size-sorted trees. When a disk is heavily
fragmented, the df algorithm used by default in ZFS will almost always
find a number of small regions in its initial cursor-based search; it
will usually only fall back to the size-sorted tree to find larger
regions. If we increase the size of the cursor-based search slightly,
and don't store segments that are smaller than a tunable size floor
in the size-sorted tree, we can further cut memory usage down to
below 20% of what the AVL trees store. This also results in further
reductions in CPU time spent loading metaslabs.
The 16KiB size floor was chosen because it results in substantial memory
usage reduction while not usually resulting in situations where we can't
find an appropriate chunk with the cursor and are forced to use an
oversized chunk from the size-sorted tree. In addition, even if we do
have to use an oversized chunk from the size-sorted tree, the chunk
would be too small to use for ZIL allocations, so it isn't as big of a
loss as it might otherwise be. And often, more small allocations will
follow the initial one, and the cursor search will now find the
remainder of the chunk we didn't use all of and use it for subsequent
allocations. Practical testing has shown little or no change in
fragmentation as a result of this change.
If the size-sorted tree becomes empty while the offset sorted one still
has entries, it will load all the entries from the offset sorted tree
and disregard the size floor until it is unloaded again. This operation
occurs rarely with the default setting, only on incredibly thoroughly
fragmented pools.
There are some other small changes to zdb to teach it to handle btrees,
but nothing major.
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy seb@delphix.com
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9181
2019-10-09 20:36:03 +03:00
|
|
|
return (TREE_ISIGN(rv));
|
2016-09-21 23:49:47 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
do_userquota_cacheflush(objset_t *os, userquota_cache_t *cache, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
void *cookie;
|
|
|
|
userquota_node_t *uqn;
|
|
|
|
|
|
|
|
ASSERT(dmu_tx_is_syncing(tx));
|
|
|
|
|
|
|
|
cookie = NULL;
|
|
|
|
while ((uqn = avl_destroy_nodes(&cache->uqc_user_deltas,
|
|
|
|
&cookie)) != NULL) {
|
2017-03-21 04:36:00 +03:00
|
|
|
/*
|
|
|
|
* os_userused_lock protects against concurrent calls to
|
|
|
|
* zap_increment_int(). It's needed because zap_increment_int()
|
|
|
|
* is not thread-safe (i.e. not atomic).
|
|
|
|
*/
|
|
|
|
mutex_enter(&os->os_userused_lock);
|
2016-09-21 23:49:47 +03:00
|
|
|
VERIFY0(zap_increment(os, DMU_USERUSED_OBJECT,
|
|
|
|
uqn->uqn_id, uqn->uqn_delta, tx));
|
2017-03-21 04:36:00 +03:00
|
|
|
mutex_exit(&os->os_userused_lock);
|
2016-09-21 23:49:47 +03:00
|
|
|
kmem_free(uqn, sizeof (*uqn));
|
|
|
|
}
|
|
|
|
avl_destroy(&cache->uqc_user_deltas);
|
|
|
|
|
|
|
|
cookie = NULL;
|
|
|
|
while ((uqn = avl_destroy_nodes(&cache->uqc_group_deltas,
|
|
|
|
&cookie)) != NULL) {
|
2017-03-21 04:36:00 +03:00
|
|
|
mutex_enter(&os->os_userused_lock);
|
2016-09-21 23:49:47 +03:00
|
|
|
VERIFY0(zap_increment(os, DMU_GROUPUSED_OBJECT,
|
|
|
|
uqn->uqn_id, uqn->uqn_delta, tx));
|
2017-03-21 04:36:00 +03:00
|
|
|
mutex_exit(&os->os_userused_lock);
|
2016-09-21 23:49:47 +03:00
|
|
|
kmem_free(uqn, sizeof (*uqn));
|
|
|
|
}
|
|
|
|
avl_destroy(&cache->uqc_group_deltas);
|
2018-02-14 01:54:54 +03:00
|
|
|
|
|
|
|
if (dmu_objset_projectquota_enabled(os)) {
|
|
|
|
cookie = NULL;
|
|
|
|
while ((uqn = avl_destroy_nodes(&cache->uqc_project_deltas,
|
|
|
|
&cookie)) != NULL) {
|
|
|
|
mutex_enter(&os->os_userused_lock);
|
|
|
|
VERIFY0(zap_increment(os, DMU_PROJECTUSED_OBJECT,
|
|
|
|
uqn->uqn_id, uqn->uqn_delta, tx));
|
|
|
|
mutex_exit(&os->os_userused_lock);
|
|
|
|
kmem_free(uqn, sizeof (*uqn));
|
|
|
|
}
|
|
|
|
avl_destroy(&cache->uqc_project_deltas);
|
|
|
|
}
|
2016-09-21 23:49:47 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
userquota_update_cache(avl_tree_t *avl, const char *id, int64_t delta)
|
|
|
|
{
|
|
|
|
userquota_node_t *uqn;
|
|
|
|
avl_index_t idx;
|
|
|
|
|
|
|
|
ASSERT(strlen(id) < sizeof (uqn->uqn_id));
|
|
|
|
/*
|
|
|
|
* Use id directly for searching because uqn_id is the first field of
|
|
|
|
* userquota_node_t and fields after uqn_id won't be accessed in
|
|
|
|
* avl_find().
|
|
|
|
*/
|
|
|
|
uqn = avl_find(avl, (const void *)id, &idx);
|
|
|
|
if (uqn == NULL) {
|
|
|
|
uqn = kmem_zalloc(sizeof (*uqn), KM_SLEEP);
|
2016-10-12 23:24:03 +03:00
|
|
|
strlcpy(uqn->uqn_id, id, sizeof (uqn->uqn_id));
|
2016-09-21 23:49:47 +03:00
|
|
|
avl_insert(avl, uqn, idx);
|
|
|
|
}
|
|
|
|
uqn->uqn_delta += delta;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
static void
|
2018-02-14 01:54:54 +03:00
|
|
|
do_userquota_update(objset_t *os, userquota_cache_t *cache, uint64_t used,
|
|
|
|
uint64_t flags, uint64_t user, uint64_t group, uint64_t project,
|
|
|
|
boolean_t subtract)
|
2010-05-29 00:45:14 +04:00
|
|
|
{
|
2018-02-14 01:54:54 +03:00
|
|
|
if (flags & DNODE_FLAG_USERUSED_ACCOUNTED) {
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
int64_t delta = DNODE_MIN_SIZE + used;
|
2016-09-21 23:49:47 +03:00
|
|
|
char name[20];
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (subtract)
|
|
|
|
delta = -delta;
|
2016-09-21 23:49:47 +03:00
|
|
|
|
2020-06-07 21:42:12 +03:00
|
|
|
(void) snprintf(name, sizeof (name), "%llx", (longlong_t)user);
|
2016-09-21 23:49:47 +03:00
|
|
|
userquota_update_cache(&cache->uqc_user_deltas, name, delta);
|
|
|
|
|
2020-06-07 21:42:12 +03:00
|
|
|
(void) snprintf(name, sizeof (name), "%llx", (longlong_t)group);
|
2016-09-21 23:49:47 +03:00
|
|
|
userquota_update_cache(&cache->uqc_group_deltas, name, delta);
|
2018-02-14 01:54:54 +03:00
|
|
|
|
|
|
|
if (dmu_objset_projectquota_enabled(os)) {
|
2020-06-07 21:42:12 +03:00
|
|
|
(void) snprintf(name, sizeof (name), "%llx",
|
|
|
|
(longlong_t)project);
|
2018-02-14 01:54:54 +03:00
|
|
|
userquota_update_cache(&cache->uqc_project_deltas,
|
|
|
|
name, delta);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
static void
|
2018-02-14 01:54:54 +03:00
|
|
|
do_userobjquota_update(objset_t *os, userquota_cache_t *cache, uint64_t flags,
|
|
|
|
uint64_t user, uint64_t group, uint64_t project, boolean_t subtract)
|
2016-10-04 21:46:10 +03:00
|
|
|
{
|
|
|
|
if (flags & DNODE_FLAG_USEROBJUSED_ACCOUNTED) {
|
|
|
|
char name[20 + DMU_OBJACCT_PREFIX_LEN];
|
2016-09-21 23:49:47 +03:00
|
|
|
int delta = subtract ? -1 : 1;
|
2016-10-04 21:46:10 +03:00
|
|
|
|
|
|
|
(void) snprintf(name, sizeof (name), DMU_OBJACCT_PREFIX "%llx",
|
|
|
|
(longlong_t)user);
|
2016-09-21 23:49:47 +03:00
|
|
|
userquota_update_cache(&cache->uqc_user_deltas, name, delta);
|
2016-10-04 21:46:10 +03:00
|
|
|
|
|
|
|
(void) snprintf(name, sizeof (name), DMU_OBJACCT_PREFIX "%llx",
|
|
|
|
(longlong_t)group);
|
2016-09-21 23:49:47 +03:00
|
|
|
userquota_update_cache(&cache->uqc_group_deltas, name, delta);
|
2018-02-14 01:54:54 +03:00
|
|
|
|
|
|
|
if (dmu_objset_projectquota_enabled(os)) {
|
|
|
|
(void) snprintf(name, sizeof (name),
|
|
|
|
DMU_OBJACCT_PREFIX "%llx", (longlong_t)project);
|
|
|
|
userquota_update_cache(&cache->uqc_project_deltas,
|
|
|
|
name, delta);
|
|
|
|
}
|
2016-10-04 21:46:10 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2017-03-21 04:36:00 +03:00
|
|
|
typedef struct userquota_updates_arg {
|
|
|
|
objset_t *uua_os;
|
|
|
|
int uua_sublist_idx;
|
|
|
|
dmu_tx_t *uua_tx;
|
|
|
|
} userquota_updates_arg_t;
|
|
|
|
|
|
|
|
static void
|
|
|
|
userquota_updates_task(void *arg)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
2017-03-21 04:36:00 +03:00
|
|
|
userquota_updates_arg_t *uua = arg;
|
|
|
|
objset_t *os = uua->uua_os;
|
|
|
|
dmu_tx_t *tx = uua->uua_tx;
|
2009-07-03 02:44:48 +04:00
|
|
|
dnode_t *dn;
|
2016-09-21 23:49:47 +03:00
|
|
|
userquota_cache_t cache = { { 0 } };
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2024-04-10 02:23:19 +03:00
|
|
|
multilist_sublist_t *list = multilist_sublist_lock_idx(
|
|
|
|
&os->os_synced_dnodes, uua->uua_sublist_idx);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2017-03-21 04:36:00 +03:00
|
|
|
ASSERT(multilist_sublist_head(list) == NULL ||
|
|
|
|
dmu_objset_userused_enabled(os));
|
2016-09-21 23:49:47 +03:00
|
|
|
avl_create(&cache.uqc_user_deltas, userquota_compare,
|
|
|
|
sizeof (userquota_node_t), offsetof(userquota_node_t, uqn_node));
|
|
|
|
avl_create(&cache.uqc_group_deltas, userquota_compare,
|
|
|
|
sizeof (userquota_node_t), offsetof(userquota_node_t, uqn_node));
|
2018-02-14 01:54:54 +03:00
|
|
|
if (dmu_objset_projectquota_enabled(os))
|
|
|
|
avl_create(&cache.uqc_project_deltas, userquota_compare,
|
|
|
|
sizeof (userquota_node_t), offsetof(userquota_node_t,
|
|
|
|
uqn_node));
|
2016-09-21 23:49:47 +03:00
|
|
|
|
2017-03-21 04:36:00 +03:00
|
|
|
while ((dn = multilist_sublist_head(list)) != NULL) {
|
2010-08-27 01:24:34 +04:00
|
|
|
int flags;
|
2009-07-03 02:44:48 +04:00
|
|
|
ASSERT(!DMU_OBJECT_IS_SPECIAL(dn->dn_object));
|
|
|
|
ASSERT(dn->dn_phys->dn_type == DMU_OT_NONE ||
|
|
|
|
dn->dn_phys->dn_flags &
|
|
|
|
DNODE_FLAG_USERUSED_ACCOUNTED);
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
flags = dn->dn_id_flags;
|
|
|
|
ASSERT(flags);
|
|
|
|
if (flags & DN_ID_OLD_EXIST) {
|
2018-02-14 01:54:54 +03:00
|
|
|
do_userquota_update(os, &cache, dn->dn_oldused,
|
|
|
|
dn->dn_oldflags, dn->dn_olduid, dn->dn_oldgid,
|
|
|
|
dn->dn_oldprojid, B_TRUE);
|
|
|
|
do_userobjquota_update(os, &cache, dn->dn_oldflags,
|
|
|
|
dn->dn_olduid, dn->dn_oldgid,
|
|
|
|
dn->dn_oldprojid, B_TRUE);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2010-08-27 01:24:34 +04:00
|
|
|
if (flags & DN_ID_NEW_EXIST) {
|
2018-02-14 01:54:54 +03:00
|
|
|
do_userquota_update(os, &cache,
|
2016-09-21 23:49:47 +03:00
|
|
|
DN_USED_BYTES(dn->dn_phys), dn->dn_phys->dn_flags,
|
2018-02-14 01:54:54 +03:00
|
|
|
dn->dn_newuid, dn->dn_newgid,
|
|
|
|
dn->dn_newprojid, B_FALSE);
|
|
|
|
do_userobjquota_update(os, &cache,
|
|
|
|
dn->dn_phys->dn_flags, dn->dn_newuid, dn->dn_newgid,
|
|
|
|
dn->dn_newprojid, B_FALSE);
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
mutex_enter(&dn->dn_mtx);
|
2010-05-29 00:45:14 +04:00
|
|
|
dn->dn_oldused = 0;
|
|
|
|
dn->dn_oldflags = 0;
|
|
|
|
if (dn->dn_id_flags & DN_ID_NEW_EXIST) {
|
|
|
|
dn->dn_olduid = dn->dn_newuid;
|
|
|
|
dn->dn_oldgid = dn->dn_newgid;
|
2018-02-14 01:54:54 +03:00
|
|
|
dn->dn_oldprojid = dn->dn_newprojid;
|
2010-05-29 00:45:14 +04:00
|
|
|
dn->dn_id_flags |= DN_ID_OLD_EXIST;
|
|
|
|
if (dn->dn_bonuslen == 0)
|
|
|
|
dn->dn_id_flags |= DN_ID_CHKED_SPILL;
|
|
|
|
else
|
|
|
|
dn->dn_id_flags |= DN_ID_CHKED_BONUS;
|
|
|
|
}
|
|
|
|
dn->dn_id_flags &= ~(DN_ID_NEW_EXIST);
|
2009-07-03 02:44:48 +04:00
|
|
|
mutex_exit(&dn->dn_mtx);
|
|
|
|
|
2017-03-21 04:36:00 +03:00
|
|
|
multilist_sublist_remove(list, dn);
|
2021-06-10 19:42:31 +03:00
|
|
|
dnode_rele(dn, &os->os_synced_dnodes);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2016-09-21 23:49:47 +03:00
|
|
|
do_userquota_cacheflush(os, &cache, tx);
|
2017-03-21 04:36:00 +03:00
|
|
|
multilist_sublist_unlock(list);
|
|
|
|
kmem_free(uua, sizeof (*uua));
|
|
|
|
}
|
|
|
|
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
/*
|
|
|
|
* Release dnode holds from dmu_objset_sync_dnodes(). When the dnode is being
|
|
|
|
* synced (i.e. we have issued the zio's for blocks in the dnode), it can't be
|
|
|
|
* evicted because the block containing the dnode can't be evicted until it is
|
|
|
|
* written out. However, this hold is necessary to prevent the dnode_t from
|
|
|
|
* being moved (via dnode_move()) while it's still referenced by
|
|
|
|
* dbuf_dirty_record_t:dr_dnode. And dr_dnode is needed for
|
|
|
|
* dirty_lightweight_leaf-type dirty records.
|
|
|
|
*
|
|
|
|
* If we are doing user-object accounting, the dnode_rele() happens from
|
|
|
|
* userquota_updates_task() instead.
|
|
|
|
*/
|
|
|
|
static void
|
|
|
|
dnode_rele_task(void *arg)
|
2017-03-21 04:36:00 +03:00
|
|
|
{
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
userquota_updates_arg_t *uua = arg;
|
|
|
|
objset_t *os = uua->uua_os;
|
|
|
|
|
2024-04-10 02:23:19 +03:00
|
|
|
multilist_sublist_t *list = multilist_sublist_lock_idx(
|
|
|
|
&os->os_synced_dnodes, uua->uua_sublist_idx);
|
2019-06-25 22:03:38 +03:00
|
|
|
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
dnode_t *dn;
|
|
|
|
while ((dn = multilist_sublist_head(list)) != NULL) {
|
|
|
|
multilist_sublist_remove(list, dn);
|
2021-06-10 19:42:31 +03:00
|
|
|
dnode_rele(dn, &os->os_synced_dnodes);
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
}
|
|
|
|
multilist_sublist_unlock(list);
|
|
|
|
kmem_free(uua, sizeof (*uua));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return TRUE if userquota updates are needed.
|
|
|
|
*/
|
|
|
|
static boolean_t
|
|
|
|
dmu_objset_do_userquota_updates_prep(objset_t *os, dmu_tx_t *tx)
|
|
|
|
{
|
2017-03-21 04:36:00 +03:00
|
|
|
if (!dmu_objset_userused_enabled(os))
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
return (B_FALSE);
|
2017-03-21 04:36:00 +03:00
|
|
|
|
2018-02-21 03:27:31 +03:00
|
|
|
/*
|
|
|
|
* If this is a raw receive just return and handle accounting
|
|
|
|
* later when we have the keys loaded. We also don't do user
|
|
|
|
* accounting during claiming since the datasets are not owned
|
|
|
|
* for the duration of claiming and this txg should only be
|
|
|
|
* used for recovery.
|
|
|
|
*/
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (os->os_encrypted && dmu_objset_is_receiving(os))
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
return (B_FALSE);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
|
2018-02-21 03:27:31 +03:00
|
|
|
if (tx->tx_txg <= os->os_spa->spa_claim_max_txg)
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
return (B_FALSE);
|
2018-02-21 03:27:31 +03:00
|
|
|
|
2018-02-14 01:54:54 +03:00
|
|
|
/* Allocate the user/group/project used objects if necessary. */
|
2017-03-21 04:36:00 +03:00
|
|
|
if (DMU_USERUSED_DNODE(os)->dn_type == DMU_OT_NONE) {
|
|
|
|
VERIFY0(zap_create_claim(os,
|
|
|
|
DMU_USERUSED_OBJECT,
|
|
|
|
DMU_OT_USERGROUP_USED, DMU_OT_NONE, 0, tx));
|
|
|
|
VERIFY0(zap_create_claim(os,
|
|
|
|
DMU_GROUPUSED_OBJECT,
|
|
|
|
DMU_OT_USERGROUP_USED, DMU_OT_NONE, 0, tx));
|
|
|
|
}
|
|
|
|
|
2018-02-14 01:54:54 +03:00
|
|
|
if (dmu_objset_projectquota_enabled(os) &&
|
|
|
|
DMU_PROJECTUSED_DNODE(os)->dn_type == DMU_OT_NONE) {
|
|
|
|
VERIFY0(zap_create_claim(os, DMU_PROJECTUSED_OBJECT,
|
|
|
|
DMU_OT_USERGROUP_USED, DMU_OT_NONE, 0, tx));
|
|
|
|
}
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
return (B_TRUE);
|
|
|
|
}
|
2018-02-14 01:54:54 +03:00
|
|
|
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
/*
|
|
|
|
* Dispatch taskq tasks to dp_sync_taskq to update the user accounting, and
|
|
|
|
* also release the holds on the dnodes from dmu_objset_sync_dnodes().
|
|
|
|
* The caller must taskq_wait(dp_sync_taskq).
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
dmu_objset_sync_done(objset_t *os, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
boolean_t need_userquota = dmu_objset_do_userquota_updates_prep(os, tx);
|
|
|
|
|
2021-06-10 19:42:31 +03:00
|
|
|
int num_sublists = multilist_get_num_sublists(&os->os_synced_dnodes);
|
2019-06-25 22:03:38 +03:00
|
|
|
for (int i = 0; i < num_sublists; i++) {
|
2017-03-21 04:36:00 +03:00
|
|
|
userquota_updates_arg_t *uua =
|
|
|
|
kmem_alloc(sizeof (*uua), KM_SLEEP);
|
|
|
|
uua->uua_os = os;
|
|
|
|
uua->uua_sublist_idx = i;
|
|
|
|
uua->uua_tx = tx;
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we don't need to update userquotas, use
|
|
|
|
* dnode_rele_task() to call dnode_rele()
|
|
|
|
*/
|
2017-03-21 04:36:00 +03:00
|
|
|
(void) taskq_dispatch(dmu_objset_pool(os)->dp_sync_taskq,
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
need_userquota ? userquota_updates_task : dnode_rele_task,
|
|
|
|
uua, 0);
|
2017-03-21 04:36:00 +03:00
|
|
|
/* callback frees uua */
|
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
|
|
|
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Returns a pointer to data to find uid/gid from
|
|
|
|
*
|
|
|
|
* If a dirty record for transaction group that is syncing can't
|
|
|
|
* be found then NULL is returned. In the NULL case it is assumed
|
|
|
|
* the uid/gid aren't changing.
|
|
|
|
*/
|
|
|
|
static void *
|
|
|
|
dmu_objset_userquota_find_data(dmu_buf_impl_t *db, dmu_tx_t *tx)
|
|
|
|
{
|
2020-02-05 22:07:19 +03:00
|
|
|
dbuf_dirty_record_t *dr;
|
2010-05-29 00:45:14 +04:00
|
|
|
void *data;
|
|
|
|
|
|
|
|
if (db->db_dirtycnt == 0)
|
|
|
|
return (db->db.db_data); /* Nothing is changing */
|
|
|
|
|
2020-02-05 22:07:19 +03:00
|
|
|
dr = dbuf_find_dirty_eq(db, tx->tx_txg);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2010-08-27 01:24:34 +04:00
|
|
|
if (dr == NULL) {
|
2010-05-29 00:45:14 +04:00
|
|
|
data = NULL;
|
2010-08-27 01:24:34 +04:00
|
|
|
} else {
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
if (dr->dr_dnode->dn_bonuslen == 0 &&
|
2010-08-27 01:24:34 +04:00
|
|
|
dr->dr_dbuf->db_blkid == DMU_SPILL_BLKID)
|
|
|
|
data = dr->dt.dl.dr_data->b_data;
|
|
|
|
else
|
|
|
|
data = dr->dt.dl.dr_data;
|
|
|
|
}
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
return (data);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
dmu_objset_userquota_get_ids(dnode_t *dn, boolean_t before, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
objset_t *os = dn->dn_objset;
|
|
|
|
void *data = NULL;
|
|
|
|
dmu_buf_impl_t *db = NULL;
|
|
|
|
int flags = dn->dn_id_flags;
|
|
|
|
int error;
|
|
|
|
boolean_t have_spill = B_FALSE;
|
|
|
|
|
|
|
|
if (!dmu_objset_userused_enabled(dn->dn_objset))
|
|
|
|
return;
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
/*
|
|
|
|
* Raw receives introduce a problem with user accounting. Raw
|
|
|
|
* receives cannot update the user accounting info because the
|
|
|
|
* user ids and the sizes are encrypted. To guarantee that we
|
|
|
|
* never end up with bad user accounting, we simply disable it
|
|
|
|
* during raw receives. We also disable this for normal receives
|
|
|
|
* so that an incremental raw receive may be done on top of an
|
|
|
|
* existing non-raw receive.
|
|
|
|
*/
|
|
|
|
if (os->os_encrypted && dmu_objset_is_receiving(os))
|
|
|
|
return;
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (before && (flags & (DN_ID_CHKED_BONUS|DN_ID_OLD_EXIST|
|
|
|
|
DN_ID_CHKED_SPILL)))
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (before && dn->dn_bonuslen != 0)
|
|
|
|
data = DN_BONUS(dn->dn_phys);
|
|
|
|
else if (!before && dn->dn_bonuslen != 0) {
|
2015-09-25 21:15:02 +03:00
|
|
|
if (dn->dn_bonus) {
|
|
|
|
db = dn->dn_bonus;
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
data = dmu_objset_userquota_find_data(db, tx);
|
|
|
|
} else {
|
|
|
|
data = DN_BONUS(dn->dn_phys);
|
|
|
|
}
|
|
|
|
} else if (dn->dn_bonuslen == 0 && dn->dn_bonustype == DMU_OT_SA) {
|
|
|
|
int rf = 0;
|
|
|
|
|
|
|
|
if (RW_WRITE_HELD(&dn->dn_struct_rwlock))
|
|
|
|
rf |= DB_RF_HAVESTRUCT;
|
2010-08-27 01:24:34 +04:00
|
|
|
error = dmu_spill_hold_by_dnode(dn,
|
|
|
|
rf | DB_RF_MUST_SUCCEED,
|
2010-05-29 00:45:14 +04:00
|
|
|
FTAG, (dmu_buf_t **)&db);
|
|
|
|
ASSERT(error == 0);
|
|
|
|
mutex_enter(&db->db_mtx);
|
|
|
|
data = (before) ? db->db.db_data :
|
|
|
|
dmu_objset_userquota_find_data(db, tx);
|
|
|
|
have_spill = B_TRUE;
|
|
|
|
} else {
|
|
|
|
mutex_enter(&dn->dn_mtx);
|
|
|
|
dn->dn_id_flags |= DN_ID_CHKED_BONUS;
|
|
|
|
mutex_exit(&dn->dn_mtx);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Must always call the callback in case the object
|
|
|
|
* type has changed and that type isn't an object type to track
|
|
|
|
*/
|
File incorrectly zeroed when receiving incremental stream that toggles -L
Background:
By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks. By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature. A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.
When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out. The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).
Changes:
This commit addresses the problem with several changes to the semantics
of zfs send/receive:
1. "-L to no-L" incrementals are rejected. If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:
incremental send stream requires -L (--large-block), to match
previous receive.
2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.
3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals. This flag is currently not set on any send streams. In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.
Implementation notes:
To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.
In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks. The zio pipeline will recompress
each smaller block individually.
A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #6224
Closes #10383
2020-06-09 20:41:01 +03:00
|
|
|
zfs_file_info_t zfi;
|
|
|
|
error = file_cbs[os->os_phys->os_type](dn->dn_bonustype, data, &zfi);
|
|
|
|
|
|
|
|
if (before) {
|
|
|
|
ASSERT(data);
|
|
|
|
dn->dn_olduid = zfi.zfi_user;
|
|
|
|
dn->dn_oldgid = zfi.zfi_group;
|
|
|
|
dn->dn_oldprojid = zfi.zfi_project;
|
|
|
|
} else if (data) {
|
|
|
|
dn->dn_newuid = zfi.zfi_user;
|
|
|
|
dn->dn_newgid = zfi.zfi_group;
|
|
|
|
dn->dn_newprojid = zfi.zfi_project;
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Preserve existing uid/gid when the callback can't determine
|
|
|
|
* what the new uid/gid are and the callback returned EEXIST.
|
|
|
|
* The EEXIST error tells us to just use the existing uid/gid.
|
|
|
|
* If we don't know what the old values are then just assign
|
|
|
|
* them to 0, since that is a new file being created.
|
|
|
|
*/
|
|
|
|
if (!before && data == NULL && error == EEXIST) {
|
|
|
|
if (flags & DN_ID_OLD_EXIST) {
|
|
|
|
dn->dn_newuid = dn->dn_olduid;
|
|
|
|
dn->dn_newgid = dn->dn_oldgid;
|
2018-03-05 23:56:27 +03:00
|
|
|
dn->dn_newprojid = dn->dn_oldprojid;
|
2010-05-29 00:45:14 +04:00
|
|
|
} else {
|
|
|
|
dn->dn_newuid = 0;
|
|
|
|
dn->dn_newgid = 0;
|
2018-02-14 01:54:54 +03:00
|
|
|
dn->dn_newprojid = ZFS_DEFAULT_PROJID;
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
error = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (db)
|
|
|
|
mutex_exit(&db->db_mtx);
|
|
|
|
|
|
|
|
mutex_enter(&dn->dn_mtx);
|
|
|
|
if (error == 0 && before)
|
|
|
|
dn->dn_id_flags |= DN_ID_OLD_EXIST;
|
|
|
|
if (error == 0 && !before)
|
|
|
|
dn->dn_id_flags |= DN_ID_NEW_EXIST;
|
|
|
|
|
|
|
|
if (have_spill) {
|
|
|
|
dn->dn_id_flags |= DN_ID_CHKED_SPILL;
|
|
|
|
} else {
|
|
|
|
dn->dn_id_flags |= DN_ID_CHKED_BONUS;
|
|
|
|
}
|
|
|
|
mutex_exit(&dn->dn_mtx);
|
2015-09-25 21:15:02 +03:00
|
|
|
if (have_spill)
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_buf_rele((dmu_buf_t *)db, FTAG);
|
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
boolean_t
|
|
|
|
dmu_objset_userspace_present(objset_t *os)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
return (os->os_phys->os_flags &
|
2009-07-03 02:44:48 +04:00
|
|
|
OBJSET_FLAG_USERACCOUNTING_COMPLETE);
|
|
|
|
}
|
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
boolean_t
|
|
|
|
dmu_objset_userobjspace_present(objset_t *os)
|
|
|
|
{
|
|
|
|
return (os->os_phys->os_flags &
|
|
|
|
OBJSET_FLAG_USEROBJACCOUNTING_COMPLETE);
|
|
|
|
}
|
|
|
|
|
2018-02-14 01:54:54 +03:00
|
|
|
boolean_t
|
|
|
|
dmu_objset_projectquota_present(objset_t *os)
|
|
|
|
{
|
|
|
|
return (os->os_phys->os_flags &
|
|
|
|
OBJSET_FLAG_PROJECTQUOTA_COMPLETE);
|
|
|
|
}
|
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
static int
|
|
|
|
dmu_objset_space_upgrade(objset_t *os)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
|
|
|
uint64_t obj;
|
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We simply need to mark every object dirty, so that it will be
|
|
|
|
* synced out and now accounted. If this is called
|
|
|
|
* concurrently, or if we already did some work before crashing,
|
|
|
|
* that's fine, since we track each object's accounted state
|
|
|
|
* independently.
|
|
|
|
*/
|
|
|
|
|
|
|
|
for (obj = 0; err == 0; err = dmu_object_next(os, &obj, FALSE, 0)) {
|
2009-08-18 22:43:27 +04:00
|
|
|
dmu_tx_t *tx;
|
2009-07-03 02:44:48 +04:00
|
|
|
dmu_buf_t *db;
|
|
|
|
int objerr;
|
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
mutex_enter(&os->os_upgrade_lock);
|
|
|
|
if (os->os_upgrade_exit)
|
|
|
|
err = SET_ERROR(EINTR);
|
|
|
|
mutex_exit(&os->os_upgrade_lock);
|
|
|
|
if (err != 0)
|
|
|
|
return (err);
|
|
|
|
|
2024-05-29 20:49:11 +03:00
|
|
|
if (issig())
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EINTR));
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
objerr = dmu_bonus_hold(os, obj, FTAG, &db);
|
2013-09-04 16:00:57 +04:00
|
|
|
if (objerr != 0)
|
2009-07-03 02:44:48 +04:00
|
|
|
continue;
|
2009-08-18 22:43:27 +04:00
|
|
|
tx = dmu_tx_create(os);
|
2009-07-03 02:44:48 +04:00
|
|
|
dmu_tx_hold_bonus(tx, obj);
|
|
|
|
objerr = dmu_tx_assign(tx, TXG_WAIT);
|
2013-09-04 16:00:57 +04:00
|
|
|
if (objerr != 0) {
|
2017-08-30 22:09:18 +03:00
|
|
|
dmu_buf_rele(db, FTAG);
|
2009-07-03 02:44:48 +04:00
|
|
|
dmu_tx_abort(tx);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
dmu_buf_will_dirty(db, tx);
|
|
|
|
dmu_buf_rele(db, FTAG);
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
}
|
2016-10-04 21:46:10 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2020-11-16 20:10:29 +03:00
|
|
|
static int
|
|
|
|
dmu_objset_userspace_upgrade_cb(objset_t *os)
|
2016-10-04 21:46:10 +03:00
|
|
|
{
|
|
|
|
int err = 0;
|
|
|
|
|
|
|
|
if (dmu_objset_userspace_present(os))
|
|
|
|
return (0);
|
|
|
|
if (dmu_objset_is_snapshot(os))
|
|
|
|
return (SET_ERROR(EINVAL));
|
|
|
|
if (!dmu_objset_userused_enabled(os))
|
|
|
|
return (SET_ERROR(ENOTSUP));
|
|
|
|
|
|
|
|
err = dmu_objset_space_upgrade(os);
|
|
|
|
if (err)
|
|
|
|
return (err);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_flags |= OBJSET_FLAG_USERACCOUNTING_COMPLETE;
|
2009-07-03 02:44:48 +04:00
|
|
|
txg_wait_synced(dmu_objset_pool(os), 0);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2020-11-16 20:10:29 +03:00
|
|
|
void
|
|
|
|
dmu_objset_userspace_upgrade(objset_t *os)
|
|
|
|
{
|
|
|
|
dmu_objset_upgrade(os, dmu_objset_userspace_upgrade_cb);
|
|
|
|
}
|
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
static int
|
2018-02-14 01:54:54 +03:00
|
|
|
dmu_objset_id_quota_upgrade_cb(objset_t *os)
|
2016-10-04 21:46:10 +03:00
|
|
|
{
|
|
|
|
int err = 0;
|
|
|
|
|
2018-02-14 01:54:54 +03:00
|
|
|
if (dmu_objset_userobjspace_present(os) &&
|
|
|
|
dmu_objset_projectquota_present(os))
|
2016-10-04 21:46:10 +03:00
|
|
|
return (0);
|
|
|
|
if (dmu_objset_is_snapshot(os))
|
|
|
|
return (SET_ERROR(EINVAL));
|
2020-11-16 20:10:29 +03:00
|
|
|
if (!dmu_objset_userused_enabled(os))
|
2016-10-04 21:46:10 +03:00
|
|
|
return (SET_ERROR(ENOTSUP));
|
2018-02-14 01:54:54 +03:00
|
|
|
if (!dmu_objset_projectquota_enabled(os) &&
|
|
|
|
dmu_objset_userobjspace_present(os))
|
|
|
|
return (SET_ERROR(ENOTSUP));
|
2016-10-04 21:46:10 +03:00
|
|
|
|
2023-02-21 20:36:22 +03:00
|
|
|
if (dmu_objset_userobjused_enabled(os))
|
|
|
|
dmu_objset_ds(os)->ds_feature_activation[
|
|
|
|
SPA_FEATURE_USEROBJ_ACCOUNTING] = (void *)B_TRUE;
|
|
|
|
if (dmu_objset_projectquota_enabled(os))
|
|
|
|
dmu_objset_ds(os)->ds_feature_activation[
|
|
|
|
SPA_FEATURE_PROJECT_QUOTA] = (void *)B_TRUE;
|
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
err = dmu_objset_space_upgrade(os);
|
|
|
|
if (err)
|
|
|
|
return (err);
|
|
|
|
|
2020-11-16 20:10:29 +03:00
|
|
|
os->os_flags |= OBJSET_FLAG_USERACCOUNTING_COMPLETE;
|
|
|
|
if (dmu_objset_userobjused_enabled(os))
|
|
|
|
os->os_flags |= OBJSET_FLAG_USEROBJACCOUNTING_COMPLETE;
|
2018-02-14 01:54:54 +03:00
|
|
|
if (dmu_objset_projectquota_enabled(os))
|
|
|
|
os->os_flags |= OBJSET_FLAG_PROJECTQUOTA_COMPLETE;
|
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
txg_wait_synced(dmu_objset_pool(os), 0);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2018-02-14 01:54:54 +03:00
|
|
|
dmu_objset_id_quota_upgrade(objset_t *os)
|
2016-10-04 21:46:10 +03:00
|
|
|
{
|
2018-02-14 01:54:54 +03:00
|
|
|
dmu_objset_upgrade(os, dmu_objset_id_quota_upgrade_cb);
|
2016-10-04 21:46:10 +03:00
|
|
|
}
|
|
|
|
|
2016-11-10 00:51:12 +03:00
|
|
|
boolean_t
|
|
|
|
dmu_objset_userobjspace_upgradable(objset_t *os)
|
|
|
|
{
|
|
|
|
return (dmu_objset_type(os) == DMU_OST_ZFS &&
|
|
|
|
!dmu_objset_is_snapshot(os) &&
|
|
|
|
dmu_objset_userobjused_enabled(os) &&
|
2019-02-20 05:41:18 +03:00
|
|
|
!dmu_objset_userobjspace_present(os) &&
|
|
|
|
spa_writeable(dmu_objset_spa(os)));
|
2016-11-10 00:51:12 +03:00
|
|
|
}
|
|
|
|
|
2018-02-14 01:54:54 +03:00
|
|
|
boolean_t
|
|
|
|
dmu_objset_projectquota_upgradable(objset_t *os)
|
|
|
|
{
|
|
|
|
return (dmu_objset_type(os) == DMU_OST_ZFS &&
|
|
|
|
!dmu_objset_is_snapshot(os) &&
|
|
|
|
dmu_objset_projectquota_enabled(os) &&
|
2019-02-20 05:41:18 +03:00
|
|
|
!dmu_objset_projectquota_present(os) &&
|
|
|
|
spa_writeable(dmu_objset_spa(os)));
|
2018-02-14 01:54:54 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
|
|
|
dmu_objset_space(objset_t *os, uint64_t *refdbytesp, uint64_t *availbytesp,
|
|
|
|
uint64_t *usedobjsp, uint64_t *availobjsp)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
dsl_dataset_space(os->os_dsl_dataset, refdbytesp, availbytesp,
|
2008-11-20 23:01:55 +03:00
|
|
|
usedobjsp, availobjsp);
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t
|
|
|
|
dmu_objset_fsid_guid(objset_t *os)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
return (dsl_dataset_fsid_guid(os->os_dsl_dataset));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
dmu_objset_fast_stat(objset_t *os, dmu_objset_stats_t *stat)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
stat->dds_type = os->os_phys->os_type;
|
|
|
|
if (os->os_dsl_dataset)
|
|
|
|
dsl_dataset_fast_stat(os->os_dsl_dataset, stat);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
dmu_objset_stats(objset_t *os, nvlist_t *nv)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(os->os_dsl_dataset ||
|
|
|
|
os->os_phys->os_type == DMU_OST_META);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (os->os_dsl_dataset != NULL)
|
|
|
|
dsl_dataset_stats(os->os_dsl_dataset, nv);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_TYPE,
|
2010-05-29 00:45:14 +04:00
|
|
|
os->os_phys->os_type);
|
2009-07-03 02:44:48 +04:00
|
|
|
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_USERACCOUNTING,
|
|
|
|
dmu_objset_userspace_present(os));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
dmu_objset_is_snapshot(objset_t *os)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
if (os->os_dsl_dataset != NULL)
|
2015-04-02 06:44:32 +03:00
|
|
|
return (os->os_dsl_dataset->ds_is_snapshot);
|
2008-11-20 23:01:55 +03:00
|
|
|
else
|
|
|
|
return (B_FALSE);
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
2020-10-03 03:44:10 +03:00
|
|
|
dmu_snapshot_realname(objset_t *os, const char *name, char *real, int maxlen,
|
2008-11-20 23:01:55 +03:00
|
|
|
boolean_t *conflict)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
dsl_dataset_t *ds = os->os_dsl_dataset;
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t ignored;
|
|
|
|
|
2015-04-01 18:14:34 +03:00
|
|
|
if (dsl_dataset_phys(ds)->ds_snapnames_zapobj == 0)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENOENT));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (zap_lookup_norm(ds->ds_dir->dd_pool->dp_meta_objset,
|
2015-04-01 18:14:34 +03:00
|
|
|
dsl_dataset_phys(ds)->ds_snapnames_zapobj, name, 8, 1, &ignored,
|
2017-02-03 01:13:41 +03:00
|
|
|
MT_NORMALIZE, real, maxlen, conflict));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
dmu_snapshot_list_next(objset_t *os, int namelen, char *name,
|
|
|
|
uint64_t *idp, uint64_t *offp, boolean_t *case_conflict)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
dsl_dataset_t *ds = os->os_dsl_dataset;
|
2008-11-20 23:01:55 +03:00
|
|
|
zap_cursor_t cursor;
|
|
|
|
zap_attribute_t attr;
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
ASSERT(dsl_pool_config_held(dmu_objset_pool(os)));
|
|
|
|
|
2015-04-01 18:14:34 +03:00
|
|
|
if (dsl_dataset_phys(ds)->ds_snapnames_zapobj == 0)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENOENT));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
zap_cursor_init_serialized(&cursor,
|
|
|
|
ds->ds_dir->dd_pool->dp_meta_objset,
|
2015-04-01 18:14:34 +03:00
|
|
|
dsl_dataset_phys(ds)->ds_snapnames_zapobj, *offp);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (zap_cursor_retrieve(&cursor, &attr) != 0) {
|
|
|
|
zap_cursor_fini(&cursor);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENOENT));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (strlen(attr.za_name) + 1 > namelen) {
|
|
|
|
zap_cursor_fini(&cursor);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENAMETOOLONG));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2020-06-07 21:42:12 +03:00
|
|
|
(void) strlcpy(name, attr.za_name, namelen);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (idp)
|
|
|
|
*idp = attr.za_first_integer;
|
|
|
|
if (case_conflict)
|
|
|
|
*case_conflict = attr.za_normalization_conflict;
|
|
|
|
zap_cursor_advance(&cursor);
|
|
|
|
*offp = zap_cursor_serialize(&cursor);
|
|
|
|
zap_cursor_fini(&cursor);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2011-11-11 11:15:53 +04:00
|
|
|
int
|
2013-01-26 02:57:53 +04:00
|
|
|
dmu_snapshot_lookup(objset_t *os, const char *name, uint64_t *value)
|
2011-11-11 11:15:53 +04:00
|
|
|
{
|
2013-11-01 23:26:11 +04:00
|
|
|
return (dsl_dataset_snap_lookup(os->os_dsl_dataset, name, value));
|
2011-11-11 11:15:53 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
int
|
|
|
|
dmu_dir_list_next(objset_t *os, int namelen, char *name,
|
|
|
|
uint64_t *idp, uint64_t *offp)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
dsl_dir_t *dd = os->os_dsl_dataset->ds_dir;
|
2008-11-20 23:01:55 +03:00
|
|
|
zap_cursor_t cursor;
|
|
|
|
zap_attribute_t attr;
|
|
|
|
|
|
|
|
/* there is no next dir on a snapshot! */
|
2010-05-29 00:45:14 +04:00
|
|
|
if (os->os_dsl_dataset->ds_object !=
|
2015-04-01 18:14:34 +03:00
|
|
|
dsl_dir_phys(dd)->dd_head_dataset_obj)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENOENT));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
zap_cursor_init_serialized(&cursor,
|
|
|
|
dd->dd_pool->dp_meta_objset,
|
2015-04-01 18:14:34 +03:00
|
|
|
dsl_dir_phys(dd)->dd_child_dir_zapobj, *offp);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (zap_cursor_retrieve(&cursor, &attr) != 0) {
|
|
|
|
zap_cursor_fini(&cursor);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENOENT));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
if (strlen(attr.za_name) + 1 > namelen) {
|
|
|
|
zap_cursor_fini(&cursor);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENAMETOOLONG));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2020-06-07 21:42:12 +03:00
|
|
|
(void) strlcpy(name, attr.za_name, namelen);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (idp)
|
|
|
|
*idp = attr.za_first_integer;
|
|
|
|
zap_cursor_advance(&cursor);
|
|
|
|
*offp = zap_cursor_serialize(&cursor);
|
|
|
|
zap_cursor_fini(&cursor);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2015-05-06 19:07:55 +03:00
|
|
|
typedef struct dmu_objset_find_ctx {
|
|
|
|
taskq_t *dc_tq;
|
|
|
|
dsl_pool_t *dc_dp;
|
|
|
|
uint64_t dc_ddobj;
|
2017-01-26 23:46:02 +03:00
|
|
|
char *dc_ddname; /* last component of ddobj's name */
|
2015-05-06 19:07:55 +03:00
|
|
|
int (*dc_func)(dsl_pool_t *, dsl_dataset_t *, void *);
|
|
|
|
void *dc_arg;
|
|
|
|
int dc_flags;
|
|
|
|
kmutex_t *dc_error_lock;
|
|
|
|
int *dc_error;
|
|
|
|
} dmu_objset_find_ctx_t;
|
|
|
|
|
|
|
|
static void
|
|
|
|
dmu_objset_find_dp_impl(dmu_objset_find_ctx_t *dcp)
|
2008-12-03 23:09:06 +03:00
|
|
|
{
|
2015-05-06 19:07:55 +03:00
|
|
|
dsl_pool_t *dp = dcp->dc_dp;
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_dir_t *dd;
|
|
|
|
dsl_dataset_t *ds;
|
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t *attr;
|
|
|
|
uint64_t thisobj;
|
2015-05-06 19:07:55 +03:00
|
|
|
int err = 0;
|
2013-09-04 16:00:57 +04:00
|
|
|
|
2015-05-06 19:07:55 +03:00
|
|
|
/* don't process if there already was an error */
|
|
|
|
if (*dcp->dc_error != 0)
|
|
|
|
goto out;
|
2013-09-04 16:00:57 +04:00
|
|
|
|
2017-01-26 23:46:02 +03:00
|
|
|
/*
|
|
|
|
* Note: passing the name (dc_ddname) here is optional, but it
|
|
|
|
* improves performance because we don't need to call
|
|
|
|
* zap_value_search() to determine the name.
|
|
|
|
*/
|
|
|
|
err = dsl_dir_hold_obj(dp, dcp->dc_ddobj, dcp->dc_ddname, FTAG, &dd);
|
2013-09-04 16:00:57 +04:00
|
|
|
if (err != 0)
|
2015-05-06 19:07:55 +03:00
|
|
|
goto out;
|
2013-09-04 16:00:57 +04:00
|
|
|
|
|
|
|
/* Don't visit hidden ($MOS & $ORIGIN) objsets. */
|
|
|
|
if (dd->dd_myname[0] == '$') {
|
|
|
|
dsl_dir_rele(dd, FTAG);
|
2015-05-06 19:07:55 +03:00
|
|
|
goto out;
|
2013-09-04 16:00:57 +04:00
|
|
|
}
|
|
|
|
|
2015-04-01 18:14:34 +03:00
|
|
|
thisobj = dsl_dir_phys(dd)->dd_head_dataset_obj;
|
2014-11-21 03:09:39 +03:00
|
|
|
attr = kmem_alloc(sizeof (zap_attribute_t), KM_SLEEP);
|
2013-09-04 16:00:57 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Iterate over all children.
|
|
|
|
*/
|
2015-05-06 19:07:55 +03:00
|
|
|
if (dcp->dc_flags & DS_FIND_CHILDREN) {
|
2013-09-04 16:00:57 +04:00
|
|
|
for (zap_cursor_init(&zc, dp->dp_meta_objset,
|
2015-04-01 18:14:34 +03:00
|
|
|
dsl_dir_phys(dd)->dd_child_dir_zapobj);
|
2013-09-04 16:00:57 +04:00
|
|
|
zap_cursor_retrieve(&zc, attr) == 0;
|
|
|
|
(void) zap_cursor_advance(&zc)) {
|
|
|
|
ASSERT3U(attr->za_integer_length, ==,
|
|
|
|
sizeof (uint64_t));
|
|
|
|
ASSERT3U(attr->za_num_integers, ==, 1);
|
|
|
|
|
2017-11-04 23:25:13 +03:00
|
|
|
dmu_objset_find_ctx_t *child_dcp =
|
2017-01-26 23:46:02 +03:00
|
|
|
kmem_alloc(sizeof (*child_dcp), KM_SLEEP);
|
2015-05-06 19:07:55 +03:00
|
|
|
*child_dcp = *dcp;
|
|
|
|
child_dcp->dc_ddobj = attr->za_first_integer;
|
2017-01-26 23:46:02 +03:00
|
|
|
child_dcp->dc_ddname = spa_strdup(attr->za_name);
|
2015-05-06 19:07:55 +03:00
|
|
|
if (dcp->dc_tq != NULL)
|
|
|
|
(void) taskq_dispatch(dcp->dc_tq,
|
|
|
|
dmu_objset_find_dp_cb, child_dcp, TQ_SLEEP);
|
|
|
|
else
|
|
|
|
dmu_objset_find_dp_impl(child_dcp);
|
2013-09-04 16:00:57 +04:00
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Iterate over all snapshots.
|
|
|
|
*/
|
2015-05-06 19:07:55 +03:00
|
|
|
if (dcp->dc_flags & DS_FIND_SNAPSHOTS) {
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_dataset_t *ds;
|
|
|
|
err = dsl_dataset_hold_obj(dp, thisobj, FTAG, &ds);
|
|
|
|
|
|
|
|
if (err == 0) {
|
2015-04-01 18:14:34 +03:00
|
|
|
uint64_t snapobj;
|
|
|
|
|
|
|
|
snapobj = dsl_dataset_phys(ds)->ds_snapnames_zapobj;
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_dataset_rele(ds, FTAG);
|
|
|
|
|
|
|
|
for (zap_cursor_init(&zc, dp->dp_meta_objset, snapobj);
|
|
|
|
zap_cursor_retrieve(&zc, attr) == 0;
|
|
|
|
(void) zap_cursor_advance(&zc)) {
|
|
|
|
ASSERT3U(attr->za_integer_length, ==,
|
|
|
|
sizeof (uint64_t));
|
|
|
|
ASSERT3U(attr->za_num_integers, ==, 1);
|
|
|
|
|
|
|
|
err = dsl_dataset_hold_obj(dp,
|
|
|
|
attr->za_first_integer, FTAG, &ds);
|
|
|
|
if (err != 0)
|
|
|
|
break;
|
2015-05-06 19:07:55 +03:00
|
|
|
err = dcp->dc_func(dp, ds, dcp->dc_arg);
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_dataset_rele(ds, FTAG);
|
|
|
|
if (err != 0)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
kmem_free(attr, sizeof (zap_attribute_t));
|
|
|
|
|
2017-01-26 23:46:02 +03:00
|
|
|
if (err != 0) {
|
|
|
|
dsl_dir_rele(dd, FTAG);
|
2015-05-06 19:07:55 +03:00
|
|
|
goto out;
|
2017-01-26 23:46:02 +03:00
|
|
|
}
|
2013-09-04 16:00:57 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Apply to self.
|
|
|
|
*/
|
|
|
|
err = dsl_dataset_hold_obj(dp, thisobj, FTAG, &ds);
|
2017-01-26 23:46:02 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Note: we hold the dir while calling dsl_dataset_hold_obj() so
|
|
|
|
* that the dir will remain cached, and we won't have to re-instantiate
|
|
|
|
* it (which could be expensive due to finding its name via
|
|
|
|
* zap_value_search()).
|
|
|
|
*/
|
|
|
|
dsl_dir_rele(dd, FTAG);
|
2013-09-04 16:00:57 +04:00
|
|
|
if (err != 0)
|
2015-05-06 19:07:55 +03:00
|
|
|
goto out;
|
|
|
|
err = dcp->dc_func(dp, ds, dcp->dc_arg);
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_dataset_rele(ds, FTAG);
|
2015-05-06 19:07:55 +03:00
|
|
|
|
|
|
|
out:
|
|
|
|
if (err != 0) {
|
|
|
|
mutex_enter(dcp->dc_error_lock);
|
|
|
|
/* only keep first error */
|
|
|
|
if (*dcp->dc_error == 0)
|
|
|
|
*dcp->dc_error = err;
|
|
|
|
mutex_exit(dcp->dc_error_lock);
|
|
|
|
}
|
|
|
|
|
2017-01-26 23:46:02 +03:00
|
|
|
if (dcp->dc_ddname != NULL)
|
|
|
|
spa_strfree(dcp->dc_ddname);
|
2015-05-06 19:07:55 +03:00
|
|
|
kmem_free(dcp, sizeof (*dcp));
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
dmu_objset_find_dp_cb(void *arg)
|
|
|
|
{
|
|
|
|
dmu_objset_find_ctx_t *dcp = arg;
|
|
|
|
dsl_pool_t *dp = dcp->dc_dp;
|
|
|
|
|
2015-07-02 18:58:17 +03:00
|
|
|
/*
|
|
|
|
* We need to get a pool_config_lock here, as there are several
|
2019-09-03 03:56:41 +03:00
|
|
|
* assert(pool_config_held) down the stack. Getting a lock via
|
2015-07-02 18:58:17 +03:00
|
|
|
* dsl_pool_config_enter is risky, as it might be stalled by a
|
|
|
|
* pending writer. This would deadlock, as the write lock can
|
|
|
|
* only be granted when our parent thread gives up the lock.
|
|
|
|
* The _prio interface gives us priority over a pending writer.
|
|
|
|
*/
|
|
|
|
dsl_pool_config_enter_prio(dp, FTAG);
|
2015-05-06 19:07:55 +03:00
|
|
|
|
|
|
|
dmu_objset_find_dp_impl(dcp);
|
|
|
|
|
|
|
|
dsl_pool_config_exit(dp, FTAG);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Find objsets under and including ddobj, call func(ds) on each.
|
|
|
|
* The order for the enumeration is completely undefined.
|
|
|
|
* func is called with dsl_pool_config held.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
dmu_objset_find_dp(dsl_pool_t *dp, uint64_t ddobj,
|
|
|
|
int func(dsl_pool_t *, dsl_dataset_t *, void *), void *arg, int flags)
|
|
|
|
{
|
|
|
|
int error = 0;
|
|
|
|
taskq_t *tq = NULL;
|
|
|
|
int ntasks;
|
|
|
|
dmu_objset_find_ctx_t *dcp;
|
|
|
|
kmutex_t err_lock;
|
|
|
|
|
|
|
|
mutex_init(&err_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
dcp = kmem_alloc(sizeof (*dcp), KM_SLEEP);
|
|
|
|
dcp->dc_tq = NULL;
|
|
|
|
dcp->dc_dp = dp;
|
|
|
|
dcp->dc_ddobj = ddobj;
|
2017-01-26 23:46:02 +03:00
|
|
|
dcp->dc_ddname = NULL;
|
2015-05-06 19:07:55 +03:00
|
|
|
dcp->dc_func = func;
|
|
|
|
dcp->dc_arg = arg;
|
|
|
|
dcp->dc_flags = flags;
|
|
|
|
dcp->dc_error_lock = &err_lock;
|
|
|
|
dcp->dc_error = &error;
|
|
|
|
|
|
|
|
if ((flags & DS_FIND_SERIALIZE) || dsl_pool_config_held_writer(dp)) {
|
|
|
|
/*
|
|
|
|
* In case a write lock is held we can't make use of
|
|
|
|
* parallelism, as down the stack of the worker threads
|
|
|
|
* the lock is asserted via dsl_pool_config_held.
|
|
|
|
* In case of a read lock this is solved by getting a read
|
|
|
|
* lock in each worker thread, which isn't possible in case
|
|
|
|
* of a writer lock. So we fall back to the synchronous path
|
|
|
|
* here.
|
|
|
|
* In the future it might be possible to get some magic into
|
|
|
|
* dsl_pool_config_held in a way that it returns true for
|
|
|
|
* the worker threads so that a single lock held from this
|
|
|
|
* thread suffices. For now, stay single threaded.
|
|
|
|
*/
|
|
|
|
dmu_objset_find_dp_impl(dcp);
|
2016-01-27 04:54:45 +03:00
|
|
|
mutex_destroy(&err_lock);
|
2015-05-06 19:07:55 +03:00
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
|
|
|
ntasks = dmu_find_threads;
|
|
|
|
if (ntasks == 0)
|
|
|
|
ntasks = vdev_count_leaves(dp->dp_spa) * 4;
|
2015-07-24 20:08:31 +03:00
|
|
|
tq = taskq_create("dmu_objset_find", ntasks, maxclsyspri, ntasks,
|
2015-05-06 19:07:55 +03:00
|
|
|
INT_MAX, 0);
|
|
|
|
if (tq == NULL) {
|
|
|
|
kmem_free(dcp, sizeof (*dcp));
|
2016-01-27 04:54:45 +03:00
|
|
|
mutex_destroy(&err_lock);
|
|
|
|
|
2015-05-06 19:07:55 +03:00
|
|
|
return (SET_ERROR(ENOMEM));
|
|
|
|
}
|
|
|
|
dcp->dc_tq = tq;
|
|
|
|
|
|
|
|
/* dcp will be freed by task */
|
|
|
|
(void) taskq_dispatch(tq, dmu_objset_find_dp_cb, dcp, TQ_SLEEP);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* PORTING: this code relies on the property of taskq_wait to wait
|
|
|
|
* until no more tasks are queued and no more tasks are active. As
|
|
|
|
* we always queue new tasks from within other tasks, task_wait
|
|
|
|
* reliably waits for the full recursion to finish, even though we
|
|
|
|
* enqueue new tasks after taskq_wait has been called.
|
|
|
|
* On platforms other than illumos, taskq_wait may not have this
|
|
|
|
* property.
|
|
|
|
*/
|
|
|
|
taskq_wait(tq);
|
|
|
|
taskq_destroy(tq);
|
|
|
|
mutex_destroy(&err_lock);
|
|
|
|
|
|
|
|
return (error);
|
2008-12-03 23:09:06 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2013-09-04 16:00:57 +04:00
|
|
|
* Find all objsets under name, and for each, call 'func(child_name, arg)'.
|
|
|
|
* The dp_config_rwlock must not be held when this is called, and it
|
|
|
|
* will not be held when the callback is called.
|
|
|
|
* Therefore this function should only be used when the pool is not changing
|
|
|
|
* (e.g. in syncing context), or the callback can deal with the possible races.
|
2008-12-03 23:09:06 +03:00
|
|
|
*/
|
2013-09-04 16:00:57 +04:00
|
|
|
static int
|
|
|
|
dmu_objset_find_impl(spa_t *spa, const char *name,
|
|
|
|
int func(const char *, void *), void *arg, int flags)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
dsl_dir_t *dd;
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_pool_t *dp = spa_get_dsl(spa);
|
2008-12-03 23:09:06 +03:00
|
|
|
dsl_dataset_t *ds;
|
2008-11-20 23:01:55 +03:00
|
|
|
zap_cursor_t zc;
|
|
|
|
zap_attribute_t *attr;
|
|
|
|
char *child;
|
2008-12-03 23:09:06 +03:00
|
|
|
uint64_t thisobj;
|
|
|
|
int err;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_pool_config_enter(dp, FTAG);
|
|
|
|
|
|
|
|
err = dsl_dir_hold(dp, name, FTAG, &dd, NULL);
|
|
|
|
if (err != 0) {
|
|
|
|
dsl_pool_config_exit(dp, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (err);
|
2013-09-04 16:00:57 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2008-12-03 23:09:06 +03:00
|
|
|
/* Don't visit hidden ($MOS & $ORIGIN) objsets. */
|
|
|
|
if (dd->dd_myname[0] == '$') {
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_dir_rele(dd, FTAG);
|
|
|
|
dsl_pool_config_exit(dp, FTAG);
|
2008-12-03 23:09:06 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2015-04-01 18:14:34 +03:00
|
|
|
thisobj = dsl_dir_phys(dd)->dd_head_dataset_obj;
|
2014-11-21 03:09:39 +03:00
|
|
|
attr = kmem_alloc(sizeof (zap_attribute_t), KM_SLEEP);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Iterate over all children.
|
|
|
|
*/
|
|
|
|
if (flags & DS_FIND_CHILDREN) {
|
2008-12-03 23:09:06 +03:00
|
|
|
for (zap_cursor_init(&zc, dp->dp_meta_objset,
|
2015-04-01 18:14:34 +03:00
|
|
|
dsl_dir_phys(dd)->dd_child_dir_zapobj);
|
2008-11-20 23:01:55 +03:00
|
|
|
zap_cursor_retrieve(&zc, attr) == 0;
|
|
|
|
(void) zap_cursor_advance(&zc)) {
|
2013-09-04 16:00:57 +04:00
|
|
|
ASSERT3U(attr->za_integer_length, ==,
|
|
|
|
sizeof (uint64_t));
|
|
|
|
ASSERT3U(attr->za_num_integers, ==, 1);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
child = kmem_asprintf("%s/%s", name, attr->za_name);
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_pool_config_exit(dp, FTAG);
|
|
|
|
err = dmu_objset_find_impl(spa, child,
|
|
|
|
func, arg, flags);
|
|
|
|
dsl_pool_config_enter(dp, FTAG);
|
2019-10-10 19:47:06 +03:00
|
|
|
kmem_strfree(child);
|
2013-09-04 16:00:57 +04:00
|
|
|
if (err != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
if (err != 0) {
|
|
|
|
dsl_dir_rele(dd, FTAG);
|
|
|
|
dsl_pool_config_exit(dp, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
kmem_free(attr, sizeof (zap_attribute_t));
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Iterate over all snapshots.
|
|
|
|
*/
|
2008-12-03 23:09:06 +03:00
|
|
|
if (flags & DS_FIND_SNAPSHOTS) {
|
|
|
|
err = dsl_dataset_hold_obj(dp, thisobj, FTAG, &ds);
|
|
|
|
|
|
|
|
if (err == 0) {
|
2015-04-01 18:14:34 +03:00
|
|
|
uint64_t snapobj;
|
|
|
|
|
|
|
|
snapobj = dsl_dataset_phys(ds)->ds_snapnames_zapobj;
|
2008-12-03 23:09:06 +03:00
|
|
|
dsl_dataset_rele(ds, FTAG);
|
|
|
|
|
|
|
|
for (zap_cursor_init(&zc, dp->dp_meta_objset, snapobj);
|
|
|
|
zap_cursor_retrieve(&zc, attr) == 0;
|
|
|
|
(void) zap_cursor_advance(&zc)) {
|
2013-09-04 16:00:57 +04:00
|
|
|
ASSERT3U(attr->za_integer_length, ==,
|
2008-12-03 23:09:06 +03:00
|
|
|
sizeof (uint64_t));
|
2013-09-04 16:00:57 +04:00
|
|
|
ASSERT3U(attr->za_num_integers, ==, 1);
|
2008-12-03 23:09:06 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
child = kmem_asprintf("%s@%s",
|
|
|
|
name, attr->za_name);
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_pool_config_exit(dp, FTAG);
|
|
|
|
err = func(child, arg);
|
|
|
|
dsl_pool_config_enter(dp, FTAG);
|
2019-10-10 19:47:06 +03:00
|
|
|
kmem_strfree(child);
|
2013-09-04 16:00:57 +04:00
|
|
|
if (err != 0)
|
2008-12-03 23:09:06 +03:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
zap_cursor_fini(&zc);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_dir_rele(dd, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
kmem_free(attr, sizeof (zap_attribute_t));
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_pool_config_exit(dp, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
if (err != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (err);
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
/* Apply to self. */
|
|
|
|
return (func(name, arg));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
/*
|
|
|
|
* See comment above dmu_objset_find_impl().
|
|
|
|
*/
|
2009-02-18 23:51:31 +03:00
|
|
|
int
|
2019-09-25 19:20:30 +03:00
|
|
|
dmu_objset_find(const char *name, int func(const char *, void *), void *arg,
|
2013-09-04 16:00:57 +04:00
|
|
|
int flags)
|
2009-02-18 23:51:31 +03:00
|
|
|
{
|
2013-09-04 16:00:57 +04:00
|
|
|
spa_t *spa;
|
|
|
|
int error;
|
2009-02-18 23:51:31 +03:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
error = spa_open(name, &spa, FTAG);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
error = dmu_objset_find_impl(spa, name, func, arg, flags);
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
return (error);
|
2009-02-18 23:51:31 +03:00
|
|
|
}
|
|
|
|
|
2017-11-08 22:12:59 +03:00
|
|
|
boolean_t
|
|
|
|
dmu_objset_incompatible_encryption_version(objset_t *os)
|
|
|
|
{
|
|
|
|
return (dsl_dir_incompatible_encryption_version(
|
|
|
|
os->os_dsl_dataset->ds_dir));
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
void
|
|
|
|
dmu_objset_set_user(objset_t *os, void *user_ptr)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(MUTEX_HELD(&os->os_user_ptr_lock));
|
|
|
|
os->os_user_ptr = user_ptr;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void *
|
|
|
|
dmu_objset_get_user(objset_t *os)
|
|
|
|
{
|
2010-05-29 00:45:14 +04:00
|
|
|
ASSERT(MUTEX_HELD(&os->os_user_ptr_lock));
|
|
|
|
return (os->os_user_ptr);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2010-08-26 22:49:16 +04:00
|
|
|
|
2013-09-04 16:00:57 +04:00
|
|
|
/*
|
|
|
|
* Determine name of filesystem, given name of snapshot.
|
2016-06-16 00:28:36 +03:00
|
|
|
* buf must be at least ZFS_MAX_DATASET_NAME_LEN bytes
|
2013-09-04 16:00:57 +04:00
|
|
|
*/
|
|
|
|
int
|
|
|
|
dmu_fsname(const char *snapname, char *buf)
|
|
|
|
{
|
|
|
|
char *atp = strchr(snapname, '@');
|
|
|
|
if (atp == NULL)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EINVAL));
|
2016-06-16 00:28:36 +03:00
|
|
|
if (atp - snapname >= ZFS_MAX_DATASET_NAME_LEN)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENAMETOOLONG));
|
2013-09-04 16:00:57 +04:00
|
|
|
(void) strlcpy(buf, snapname, atp - snapname + 1);
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Background information: This assertion about tx_space_* verifies that we
are not dirtying more stuff than we thought we would. We “need” to know
how much we will dirty so that we can check if we should fail this
transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the
transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() —
typically less than a millisecond), we call dbuf_dirty() on the exact
blocks that will be modified. Once this happens, the temporary
accounting in tx_space_* is unnecessary, because we know exactly what
blocks are newly dirtied; we call dnode_willuse_space() to track this
more exact accounting.
The fundamental problem causing this bug is that dmu_tx_hold_*() relies
on the current state in the DMU (e.g. dn_nlevels) to predict how much
will be dirtied by this transaction, but this state can change before we
actually perform the transaction (i.e. call dbuf_dirty()).
This bug will be fixed by removing the assertion that the tx_space_*
accounting is perfectly accurate (i.e. we never dirty more than was
predicted by dmu_tx_hold_*()). By removing the requirement that this
accounting be perfectly accurate, we can also vastly simplify it, e.g.
removing most of the logic in dmu_tx_count_*().
The new tx space accounting will be very approximate, and may be more or
less than what is actually dirtied. It will still be used to determine
if this transaction will put us over quota. Transactions that are marked
by dmu_tx_mark_netfree() will be excepted from this check. We won’t make
an attempt to determine how much space will be freed by the transaction
— this was rarely accurate enough to determine if a transaction should
be permitted when we are over quota, which is why dmu_tx_mark_netfree()
was introduced in 2014.
We also won’t attempt to give “credit” when overwriting existing blocks,
if those blocks may be freed. This allows us to remove the
do_free_accounting logic in dbuf_dirty(), and associated routines. This
logic attempted to predict what will be on disk when this txg syncs, to
know if the overwritten block will be freed (i.e. exists, and has no
snapshots).
OpenZFS-issue: https://www.illumos.org/issues/7793
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3704e0a
Upstream bugs: DLPX-32883a
Closes #5804
Porting notes:
- DNODE_SIZE replaced with DNODE_MIN_SIZE in dmu_tx_count_dnode(),
Using the default dnode size would be slightly better.
- DEBUG_DMU_TX wrappers and configure option removed.
- Resolved _by_dnode() conflicts these changes have not yet been
applied to OpenZFS.
2017-03-07 20:51:59 +03:00
|
|
|
/*
|
dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta()
Even though the bug's writeup (Github issue #9136) is very detailed,
we still don't know exactly how we got to that state, thus I wasn't
able to reproduce the bug. That said, we can make an educated guess
combining the information on filled issue with the code.
From the fact that `dp_dirty_total` was 0 (which is less than
`zfs_dirty_data_max`) we know that there was one thread that set it to
0 and then signaled one of the waiters of `dp_spaceavail_cv` [see
`dsl_pool_dirty_delta()` which is also the only place that
`dp_dirty_total` is changed]. Thus, the only logical explaination
then for the bug being hit is that the waiter that just got awaken
didn't go through `dsl_pool_dirty_data()`. Given that this function
is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()`
I can only think of two possible ways of the above scenario happening:
[1] The waiter didn't call into any of the two functions - which I
find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin
with?).
[2] The waiter did call in one of the above function but it passed 0 as
the space/delta to be dirtied (or undirtied) and then the callee
returned immediately (e.g both `dsl_pool_dirty_space()` and
`dsl_pool_undirty_space()` return immediately when space is 0).
In any case and no matter how we got there, the easy fix would be to
just broadcast to all waiters whenever `dp_dirty_total` hits 0. That
said and given that we've never hit this before, it would make sense
to think more on why the above situation occured.
Attempting to mimic what Prakash was doing in the issue filed, I
created a dataset with `sync=always` and started doing contiguous
writes in a file within that dataset. I observed with DTrace that even
though we update the pool's dirty data accounting when we would dirty
stuff, the accounting wouldn't be decremented incrementally as we were
done with the ZIOs of those writes (the reason being that
`dbuf_write_physdone()` isn't be called as we go through the override
code paths, and thus `dsl_pool_undirty_space()` is never called). As a
result we'd have to wait until we get to `dsl_pool_sync()` where we
zero out all dirty data accounting for the pool and the current TXG's
metadata.
In addition, as Matt noted and I later verified, the same issue would
arise when using dedup.
In both cases (sync & dedup) we shouldn't have to wait until
`dsl_pool_sync()` zeros out the accounting data. According to the
comment in that part of the code, the reasons why we do the zeroing,
have nothing to do with what we observe:
````
/*
* We have written all of the accounted dirty data, so our
* dp_space_towrite should now be zero. However, some seldom-used
* code paths do not adhere to this (e.g. dbuf_undirty(), also
* rounding error in dbuf_write_physdone).
* Shore up the accounting of any dirtied space now.
*/
dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg);
````
Ideally what we want to do is to undirty in the accounting exactly what
we dirty (I use the word ideally as we can still have rounding errors).
This would make the behavior of the system more clear and predictable.
Another interesting issue that I observed with DTrace was that we
wouldn't update any of the pool's dirty data accounting whenever we
would dirty and/or undirty MOS data. In addition, every time we would
change the size of a dbuf through `dbuf_new_size()` we wouldn't update
the accounted space dirtied in the appropriate dirty record, so when
ZIOs are done we would undirty less that we dirtied from the pool's
accounting point of view.
For the first two issues observed (sync & dedup) this patch ensures
that we still update the pool's accounting when we undirty data,
regardless of the write being physical or not.
For changes in the MOS, we first ensure to zero out the pool's dirty
data accounting in `dsl_pool_sync()` after we synced the MOS. Then we
can go ahead and enable the update of the pool's dirty data accounting
wheneve we change MOS data.
Another fix is that we now update the accounting explicitly for
counting errors in `dbuf_write_done()`.
Finally, `dbuf_new_size()` updates the accounted space of the
appropriate dirty record correctly now.
The problem is that we still don't know how the bug came up in the
issue filled. That said the issues fixed seem to be very relevant, so
instead of going with the broadcasting solution right away,
I decided to leave this patch as is.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
External-issue: DLPX-47285
Closes #9137
2019-08-16 02:53:53 +03:00
|
|
|
* Call when we think we're going to write/free space in open context
|
|
|
|
* to track the amount of dirty data in the open txg, which is also the
|
|
|
|
* amount of memory that can not be evicted until this txg syncs.
|
|
|
|
*
|
|
|
|
* Note that there are two conditions where this can be called from
|
|
|
|
* syncing context:
|
|
|
|
*
|
|
|
|
* [1] When we just created the dataset, in which case we go on with
|
|
|
|
* updating any accounting of dirty data as usual.
|
|
|
|
* [2] When we are dirtying MOS data, in which case we only update the
|
|
|
|
* pool's accounting of dirty data.
|
OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Background information: This assertion about tx_space_* verifies that we
are not dirtying more stuff than we thought we would. We “need” to know
how much we will dirty so that we can check if we should fail this
transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the
transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() —
typically less than a millisecond), we call dbuf_dirty() on the exact
blocks that will be modified. Once this happens, the temporary
accounting in tx_space_* is unnecessary, because we know exactly what
blocks are newly dirtied; we call dnode_willuse_space() to track this
more exact accounting.
The fundamental problem causing this bug is that dmu_tx_hold_*() relies
on the current state in the DMU (e.g. dn_nlevels) to predict how much
will be dirtied by this transaction, but this state can change before we
actually perform the transaction (i.e. call dbuf_dirty()).
This bug will be fixed by removing the assertion that the tx_space_*
accounting is perfectly accurate (i.e. we never dirty more than was
predicted by dmu_tx_hold_*()). By removing the requirement that this
accounting be perfectly accurate, we can also vastly simplify it, e.g.
removing most of the logic in dmu_tx_count_*().
The new tx space accounting will be very approximate, and may be more or
less than what is actually dirtied. It will still be used to determine
if this transaction will put us over quota. Transactions that are marked
by dmu_tx_mark_netfree() will be excepted from this check. We won’t make
an attempt to determine how much space will be freed by the transaction
— this was rarely accurate enough to determine if a transaction should
be permitted when we are over quota, which is why dmu_tx_mark_netfree()
was introduced in 2014.
We also won’t attempt to give “credit” when overwriting existing blocks,
if those blocks may be freed. This allows us to remove the
do_free_accounting logic in dbuf_dirty(), and associated routines. This
logic attempted to predict what will be on disk when this txg syncs, to
know if the overwritten block will be freed (i.e. exists, and has no
snapshots).
OpenZFS-issue: https://www.illumos.org/issues/7793
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3704e0a
Upstream bugs: DLPX-32883a
Closes #5804
Porting notes:
- DNODE_SIZE replaced with DNODE_MIN_SIZE in dmu_tx_count_dnode(),
Using the default dnode size would be slightly better.
- DEBUG_DMU_TX wrappers and configure option removed.
- Resolved _by_dnode() conflicts these changes have not yet been
applied to OpenZFS.
2017-03-07 20:51:59 +03:00
|
|
|
*/
|
|
|
|
void
|
|
|
|
dmu_objset_willuse_space(objset_t *os, int64_t space, dmu_tx_t *tx)
|
|
|
|
{
|
|
|
|
dsl_dataset_t *ds = os->os_dsl_dataset;
|
|
|
|
int64_t aspace = spa_get_worst_case_asize(os->os_spa, space);
|
|
|
|
|
|
|
|
if (ds != NULL) {
|
|
|
|
dsl_dir_willuse_space(ds->ds_dir, aspace, tx);
|
|
|
|
}
|
dmu_tx_wait() hang likely due to cv_signal() in dsl_pool_dirty_delta()
Even though the bug's writeup (Github issue #9136) is very detailed,
we still don't know exactly how we got to that state, thus I wasn't
able to reproduce the bug. That said, we can make an educated guess
combining the information on filled issue with the code.
From the fact that `dp_dirty_total` was 0 (which is less than
`zfs_dirty_data_max`) we know that there was one thread that set it to
0 and then signaled one of the waiters of `dp_spaceavail_cv` [see
`dsl_pool_dirty_delta()` which is also the only place that
`dp_dirty_total` is changed]. Thus, the only logical explaination
then for the bug being hit is that the waiter that just got awaken
didn't go through `dsl_pool_dirty_data()`. Given that this function
is only called by `dsl_pool_dirty_space()` or `dsl_pool_undirty_space()`
I can only think of two possible ways of the above scenario happening:
[1] The waiter didn't call into any of the two functions - which I
find highly unlikely (i.e. why wait on `dp_spaceavail_cv` to begin
with?).
[2] The waiter did call in one of the above function but it passed 0 as
the space/delta to be dirtied (or undirtied) and then the callee
returned immediately (e.g both `dsl_pool_dirty_space()` and
`dsl_pool_undirty_space()` return immediately when space is 0).
In any case and no matter how we got there, the easy fix would be to
just broadcast to all waiters whenever `dp_dirty_total` hits 0. That
said and given that we've never hit this before, it would make sense
to think more on why the above situation occured.
Attempting to mimic what Prakash was doing in the issue filed, I
created a dataset with `sync=always` and started doing contiguous
writes in a file within that dataset. I observed with DTrace that even
though we update the pool's dirty data accounting when we would dirty
stuff, the accounting wouldn't be decremented incrementally as we were
done with the ZIOs of those writes (the reason being that
`dbuf_write_physdone()` isn't be called as we go through the override
code paths, and thus `dsl_pool_undirty_space()` is never called). As a
result we'd have to wait until we get to `dsl_pool_sync()` where we
zero out all dirty data accounting for the pool and the current TXG's
metadata.
In addition, as Matt noted and I later verified, the same issue would
arise when using dedup.
In both cases (sync & dedup) we shouldn't have to wait until
`dsl_pool_sync()` zeros out the accounting data. According to the
comment in that part of the code, the reasons why we do the zeroing,
have nothing to do with what we observe:
````
/*
* We have written all of the accounted dirty data, so our
* dp_space_towrite should now be zero. However, some seldom-used
* code paths do not adhere to this (e.g. dbuf_undirty(), also
* rounding error in dbuf_write_physdone).
* Shore up the accounting of any dirtied space now.
*/
dsl_pool_undirty_space(dp, dp->dp_dirty_pertxg[txg & TXG_MASK], txg);
````
Ideally what we want to do is to undirty in the accounting exactly what
we dirty (I use the word ideally as we can still have rounding errors).
This would make the behavior of the system more clear and predictable.
Another interesting issue that I observed with DTrace was that we
wouldn't update any of the pool's dirty data accounting whenever we
would dirty and/or undirty MOS data. In addition, every time we would
change the size of a dbuf through `dbuf_new_size()` we wouldn't update
the accounted space dirtied in the appropriate dirty record, so when
ZIOs are done we would undirty less that we dirtied from the pool's
accounting point of view.
For the first two issues observed (sync & dedup) this patch ensures
that we still update the pool's accounting when we undirty data,
regardless of the write being physical or not.
For changes in the MOS, we first ensure to zero out the pool's dirty
data accounting in `dsl_pool_sync()` after we synced the MOS. Then we
can go ahead and enable the update of the pool's dirty data accounting
wheneve we change MOS data.
Another fix is that we now update the accounting explicitly for
counting errors in `dbuf_write_done()`.
Finally, `dbuf_new_size()` updates the accounted space of the
appropriate dirty record correctly now.
The problem is that we still don't know how the bug came up in the
issue filled. That said the issues fixed seem to be very relevant, so
instead of going with the broadcasting solution right away,
I decided to leave this patch as is.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
External-issue: DLPX-47285
Closes #9137
2019-08-16 02:53:53 +03:00
|
|
|
|
|
|
|
dsl_pool_dirty_space(dmu_tx_pool(tx), space, tx);
|
OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Background information: This assertion about tx_space_* verifies that we
are not dirtying more stuff than we thought we would. We “need” to know
how much we will dirty so that we can check if we should fail this
transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the
transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() —
typically less than a millisecond), we call dbuf_dirty() on the exact
blocks that will be modified. Once this happens, the temporary
accounting in tx_space_* is unnecessary, because we know exactly what
blocks are newly dirtied; we call dnode_willuse_space() to track this
more exact accounting.
The fundamental problem causing this bug is that dmu_tx_hold_*() relies
on the current state in the DMU (e.g. dn_nlevels) to predict how much
will be dirtied by this transaction, but this state can change before we
actually perform the transaction (i.e. call dbuf_dirty()).
This bug will be fixed by removing the assertion that the tx_space_*
accounting is perfectly accurate (i.e. we never dirty more than was
predicted by dmu_tx_hold_*()). By removing the requirement that this
accounting be perfectly accurate, we can also vastly simplify it, e.g.
removing most of the logic in dmu_tx_count_*().
The new tx space accounting will be very approximate, and may be more or
less than what is actually dirtied. It will still be used to determine
if this transaction will put us over quota. Transactions that are marked
by dmu_tx_mark_netfree() will be excepted from this check. We won’t make
an attempt to determine how much space will be freed by the transaction
— this was rarely accurate enough to determine if a transaction should
be permitted when we are over quota, which is why dmu_tx_mark_netfree()
was introduced in 2014.
We also won’t attempt to give “credit” when overwriting existing blocks,
if those blocks may be freed. This allows us to remove the
do_free_accounting logic in dbuf_dirty(), and associated routines. This
logic attempted to predict what will be on disk when this txg syncs, to
know if the overwritten block will be freed (i.e. exists, and has no
snapshots).
OpenZFS-issue: https://www.illumos.org/issues/7793
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3704e0a
Upstream bugs: DLPX-32883a
Closes #5804
Porting notes:
- DNODE_SIZE replaced with DNODE_MIN_SIZE in dmu_tx_count_dnode(),
Using the default dnode size would be slightly better.
- DEBUG_DMU_TX wrappers and configure option removed.
- Resolved _by_dnode() conflicts these changes have not yet been
applied to OpenZFS.
2017-03-07 20:51:59 +03:00
|
|
|
}
|
|
|
|
|
2018-02-16 04:53:18 +03:00
|
|
|
#if defined(_KERNEL)
|
2012-04-05 00:46:55 +04:00
|
|
|
EXPORT_SYMBOL(dmu_objset_zil);
|
2010-08-26 22:49:16 +04:00
|
|
|
EXPORT_SYMBOL(dmu_objset_pool);
|
2012-04-05 00:46:55 +04:00
|
|
|
EXPORT_SYMBOL(dmu_objset_ds);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_type);
|
2010-08-26 22:49:16 +04:00
|
|
|
EXPORT_SYMBOL(dmu_objset_name);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_hold);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
EXPORT_SYMBOL(dmu_objset_hold_flags);
|
2010-08-26 22:49:16 +04:00
|
|
|
EXPORT_SYMBOL(dmu_objset_own);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_rele);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
EXPORT_SYMBOL(dmu_objset_rele_flags);
|
2010-08-26 22:49:16 +04:00
|
|
|
EXPORT_SYMBOL(dmu_objset_disown);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_from_ds);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_create);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_clone);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_stats);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_fast_stat);
|
2011-01-22 01:35:41 +03:00
|
|
|
EXPORT_SYMBOL(dmu_objset_spa);
|
2010-08-26 22:49:16 +04:00
|
|
|
EXPORT_SYMBOL(dmu_objset_space);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_fsid_guid);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_find);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_byteswap);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_evict_dbufs);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_snap_cmtime);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 04:25:34 +03:00
|
|
|
EXPORT_SYMBOL(dmu_objset_dnodesize);
|
2010-08-26 22:49:16 +04:00
|
|
|
|
|
|
|
EXPORT_SYMBOL(dmu_objset_sync);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_is_dirty);
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
EXPORT_SYMBOL(dmu_objset_create_impl_dnstats);
|
2010-08-26 22:49:16 +04:00
|
|
|
EXPORT_SYMBOL(dmu_objset_create_impl);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_open_impl);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_evict);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_register_type);
|
Improve zfs receive performance with lightweight write
The performance of `zfs receive` can be bottlenecked on the CPU consumed
by the `receive_writer` thread, especially when receiving streams with
small compressed block sizes. Much of the CPU is spent creating and
destroying dbuf's and arc buf's, one for each `WRITE` record in the send
stream.
This commit introduces the concept of "lightweight writes", which allows
`zfs receive` to write to the DMU by providing an ABD, and instantiating
only a new type of `dbuf_dirty_record_t`. The dbuf and arc buf for this
"dirty leaf block" are not instantiated.
Because there is no dbuf with the dirty data, this mechanism doesn't
support reading from "lightweight-dirty" blocks (they would see the
on-disk state rather than the dirty data). Since the dedup-receive code
has been removed, `zfs receive` is write-only, so this works fine.
Because there are no arc bufs for the received data, the received data
is no longer cached in the ARC.
Testing a receive of a stream with average compressed block size of 4KB,
this commit improves performance by 50%, while also reducing CPU usage
by 50% of a CPU. On a per-block basis, CPU consumed by receive_writer()
and dbuf_evict() is now 1/7th (14%) of what it was.
Baseline: 450MB/s, CPU in receive_writer() 40% + dbuf_evict() 35%
New: 670MB/s, CPU in receive_writer() 17% + dbuf_evict() 0%
The code is also restructured in a few ways:
Added a `dr_dnode` field to the dbuf_dirty_record_t. This simplifies
some existing code that no longer needs `DB_DNODE_ENTER()` and related
routines. The new field is needed by the lightweight-type dirty record.
To ensure that the `dr_dnode` field remains valid until the dirty record
is freed, we have to ensure that the `dnode_move()` doesn't relocate the
dnode_t. To do this we keep a hold on the dnode until it's zio's have
completed. This is already done by the user-accounting code
(`userquota_updates_task()`), this commit extends that so that it always
keeps the dnode hold until zio completion (see `dnode_rele_task()`).
`dn_dirty_txg` was previously zeroed when the dnode was synced. This
was not necessary, since its meaning can be "when was this dnode last
dirtied". This change simplifies the new `dnode_rele_task()` code.
Removed some dead code related to `DRR_WRITE_BYREF` (dedup receive).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #11105
2020-12-11 21:26:02 +03:00
|
|
|
EXPORT_SYMBOL(dmu_objset_sync_done);
|
2010-08-26 22:49:16 +04:00
|
|
|
EXPORT_SYMBOL(dmu_objset_userquota_get_ids);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_userused_enabled);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_userspace_upgrade);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_userspace_present);
|
2016-10-04 21:46:10 +03:00
|
|
|
EXPORT_SYMBOL(dmu_objset_userobjused_enabled);
|
2016-11-10 00:51:12 +03:00
|
|
|
EXPORT_SYMBOL(dmu_objset_userobjspace_upgradable);
|
2016-10-04 21:46:10 +03:00
|
|
|
EXPORT_SYMBOL(dmu_objset_userobjspace_present);
|
2018-02-14 01:54:54 +03:00
|
|
|
EXPORT_SYMBOL(dmu_objset_projectquota_enabled);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_projectquota_present);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_projectquota_upgradable);
|
|
|
|
EXPORT_SYMBOL(dmu_objset_id_quota_upgrade);
|
2010-08-26 22:49:16 +04:00
|
|
|
#endif
|