2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* CDDL HEADER START
|
|
|
|
*
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
*
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
2022-07-12 00:16:13 +03:00
|
|
|
* or https://opensource.org/licenses/CDDL-1.0.
|
2008-11-20 23:01:55 +03:00
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
* and limitations under the License.
|
|
|
|
*
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
*
|
|
|
|
* CDDL HEADER END
|
|
|
|
*/
|
|
|
|
/*
|
2010-05-29 00:45:14 +04:00
|
|
|
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
|
2018-08-20 19:52:37 +03:00
|
|
|
* Copyright (c) 2012, 2018 by Delphix. All rights reserved.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/* Portions Copyright 2010 Robert Milkowski */
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/types.h>
|
|
|
|
#include <sys/param.h>
|
|
|
|
#include <sys/sysmacros.h>
|
|
|
|
#include <sys/kmem.h>
|
|
|
|
#include <sys/pathname.h>
|
|
|
|
#include <sys/vnode.h>
|
|
|
|
#include <sys/vfs.h>
|
|
|
|
#include <sys/mntent.h>
|
|
|
|
#include <sys/cmn_err.h>
|
|
|
|
#include <sys/zfs_znode.h>
|
2011-02-08 22:16:06 +03:00
|
|
|
#include <sys/zfs_vnops.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/zfs_dir.h>
|
|
|
|
#include <sys/zil.h>
|
|
|
|
#include <sys/fs/zfs.h>
|
|
|
|
#include <sys/dmu.h>
|
|
|
|
#include <sys/dsl_prop.h>
|
|
|
|
#include <sys/dsl_dataset.h>
|
|
|
|
#include <sys/dsl_deleg.h>
|
|
|
|
#include <sys/spa.h>
|
|
|
|
#include <sys/zap.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include <sys/sa.h>
|
2013-01-14 21:31:53 +04:00
|
|
|
#include <sys/sa_impl.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/policy.h>
|
|
|
|
#include <sys/atomic.h>
|
|
|
|
#include <sys/zfs_ioctl.h>
|
2011-11-11 11:15:53 +04:00
|
|
|
#include <sys/zfs_ctldir.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/zfs_fuid.h>
|
2019-12-11 23:12:08 +03:00
|
|
|
#include <sys/zfs_quota.h>
|
2008-11-20 23:01:55 +03:00
|
|
|
#include <sys/sunddi.h>
|
|
|
|
#include <sys/dmu_objset.h>
|
2020-04-01 20:02:06 +03:00
|
|
|
#include <sys/dsl_dir.h>
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
#include <sys/objlist.h>
|
2011-02-08 22:16:06 +03:00
|
|
|
#include <sys/zpl.h>
|
2019-01-11 02:28:44 +03:00
|
|
|
#include <linux/vfs_compat.h>
|
Adding Direct IO Support
Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.
O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated TXG is synced.
For both O_DIRECT read and write request the offset and request sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).
For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.
For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.
For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.
Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
any data in the user buffers and written directly down to the
VDEV disks is guaranteed to not change. There is no concern
with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
protection. Because of this, if the user decides to manipulate
the page contents while the write operation is occurring, data
integrity can not be guaranteed. However, there is a module
parameter `zfs_vdev_direct_write_verify` that controls the
if a O_DIRECT writes that can occur to a top-level VDEV before
a checksum verify is run before the contents of the I/O buffer
are committed to disk. In the event of a checksum verification
failure the write will return EIO. The number of O_DIRECT write
checksum verification errors can be observed by doing
`zpool status -d`, which will list all verification errors that
have occurred on a top-level VDEV. Along with `zpool status`, a
ZED event will be issues as `dio_verify` when a checksum
verification error occurs.
ZVOLs and dedup is not currently supported with Direct I/O.
A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
the request as a buffered IO request.
standard - Follows the alignment restrictions outlined above for
write/read IO requests when the O_DIRECT flag is used.
always - Treats every write/read IO request as though it passed
O_DIRECT and will do O_DIRECT if the alignment restrictions
are met otherwise will redirect through the ARC. This
property will not allow a request to fail.
There is also a module parameter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
parameter to 0, it mimics as if the direct dataset property is set to
disabled.
Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Co-authored-by: Mark Maybee <mark.maybee@delphix.com>
Co-authored-by: Matt Macy <mmacy@FreeBSD.org>
Co-authored-by: Brian Behlendorf <behlendorf@llnl.gov>
Closes #10018
2024-09-14 23:47:59 +03:00
|
|
|
#include <linux/fs.h>
|
2010-05-29 00:45:14 +04:00
|
|
|
#include "zfs_comutil.h"
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-03-09 03:56:09 +03:00
|
|
|
enum {
|
|
|
|
TOKEN_RO,
|
|
|
|
TOKEN_RW,
|
|
|
|
TOKEN_SETUID,
|
|
|
|
TOKEN_NOSETUID,
|
|
|
|
TOKEN_EXEC,
|
|
|
|
TOKEN_NOEXEC,
|
|
|
|
TOKEN_DEVICES,
|
|
|
|
TOKEN_NODEVICES,
|
|
|
|
TOKEN_DIRXATTR,
|
|
|
|
TOKEN_SAXATTR,
|
|
|
|
TOKEN_XATTR,
|
|
|
|
TOKEN_NOXATTR,
|
|
|
|
TOKEN_ATIME,
|
|
|
|
TOKEN_NOATIME,
|
|
|
|
TOKEN_RELATIME,
|
|
|
|
TOKEN_NORELATIME,
|
|
|
|
TOKEN_NBMAND,
|
|
|
|
TOKEN_NONBMAND,
|
|
|
|
TOKEN_MNTPOINT,
|
|
|
|
TOKEN_LAST,
|
|
|
|
};
|
|
|
|
|
|
|
|
static const match_table_t zpl_tokens = {
|
|
|
|
{ TOKEN_RO, MNTOPT_RO },
|
|
|
|
{ TOKEN_RW, MNTOPT_RW },
|
|
|
|
{ TOKEN_SETUID, MNTOPT_SETUID },
|
|
|
|
{ TOKEN_NOSETUID, MNTOPT_NOSETUID },
|
|
|
|
{ TOKEN_EXEC, MNTOPT_EXEC },
|
|
|
|
{ TOKEN_NOEXEC, MNTOPT_NOEXEC },
|
|
|
|
{ TOKEN_DEVICES, MNTOPT_DEVICES },
|
|
|
|
{ TOKEN_NODEVICES, MNTOPT_NODEVICES },
|
|
|
|
{ TOKEN_DIRXATTR, MNTOPT_DIRXATTR },
|
|
|
|
{ TOKEN_SAXATTR, MNTOPT_SAXATTR },
|
|
|
|
{ TOKEN_XATTR, MNTOPT_XATTR },
|
|
|
|
{ TOKEN_NOXATTR, MNTOPT_NOXATTR },
|
|
|
|
{ TOKEN_ATIME, MNTOPT_ATIME },
|
|
|
|
{ TOKEN_NOATIME, MNTOPT_NOATIME },
|
|
|
|
{ TOKEN_RELATIME, MNTOPT_RELATIME },
|
|
|
|
{ TOKEN_NORELATIME, MNTOPT_NORELATIME },
|
|
|
|
{ TOKEN_NBMAND, MNTOPT_NBMAND },
|
|
|
|
{ TOKEN_NONBMAND, MNTOPT_NONBMAND },
|
|
|
|
{ TOKEN_MNTPOINT, MNTOPT_MNTPOINT "=%s" },
|
|
|
|
{ TOKEN_LAST, NULL },
|
|
|
|
};
|
|
|
|
|
|
|
|
static void
|
|
|
|
zfsvfs_vfs_free(vfs_t *vfsp)
|
|
|
|
{
|
|
|
|
if (vfsp != NULL) {
|
|
|
|
if (vfsp->vfs_mntpoint != NULL)
|
2019-10-10 19:47:06 +03:00
|
|
|
kmem_strfree(vfsp->vfs_mntpoint);
|
2017-03-09 03:56:09 +03:00
|
|
|
|
|
|
|
kmem_free(vfsp, sizeof (vfs_t));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static int
|
|
|
|
zfsvfs_parse_option(char *option, int token, substring_t *args, vfs_t *vfsp)
|
|
|
|
{
|
|
|
|
switch (token) {
|
|
|
|
case TOKEN_RO:
|
|
|
|
vfsp->vfs_readonly = B_TRUE;
|
|
|
|
vfsp->vfs_do_readonly = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_RW:
|
|
|
|
vfsp->vfs_readonly = B_FALSE;
|
|
|
|
vfsp->vfs_do_readonly = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_SETUID:
|
|
|
|
vfsp->vfs_setuid = B_TRUE;
|
|
|
|
vfsp->vfs_do_setuid = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_NOSETUID:
|
|
|
|
vfsp->vfs_setuid = B_FALSE;
|
|
|
|
vfsp->vfs_do_setuid = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_EXEC:
|
|
|
|
vfsp->vfs_exec = B_TRUE;
|
|
|
|
vfsp->vfs_do_exec = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_NOEXEC:
|
|
|
|
vfsp->vfs_exec = B_FALSE;
|
|
|
|
vfsp->vfs_do_exec = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_DEVICES:
|
|
|
|
vfsp->vfs_devices = B_TRUE;
|
|
|
|
vfsp->vfs_do_devices = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_NODEVICES:
|
|
|
|
vfsp->vfs_devices = B_FALSE;
|
|
|
|
vfsp->vfs_do_devices = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_DIRXATTR:
|
|
|
|
vfsp->vfs_xattr = ZFS_XATTR_DIR;
|
|
|
|
vfsp->vfs_do_xattr = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_SAXATTR:
|
|
|
|
vfsp->vfs_xattr = ZFS_XATTR_SA;
|
|
|
|
vfsp->vfs_do_xattr = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_XATTR:
|
|
|
|
vfsp->vfs_xattr = ZFS_XATTR_DIR;
|
|
|
|
vfsp->vfs_do_xattr = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_NOXATTR:
|
|
|
|
vfsp->vfs_xattr = ZFS_XATTR_OFF;
|
|
|
|
vfsp->vfs_do_xattr = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_ATIME:
|
|
|
|
vfsp->vfs_atime = B_TRUE;
|
|
|
|
vfsp->vfs_do_atime = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_NOATIME:
|
|
|
|
vfsp->vfs_atime = B_FALSE;
|
|
|
|
vfsp->vfs_do_atime = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_RELATIME:
|
|
|
|
vfsp->vfs_relatime = B_TRUE;
|
|
|
|
vfsp->vfs_do_relatime = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_NORELATIME:
|
|
|
|
vfsp->vfs_relatime = B_FALSE;
|
|
|
|
vfsp->vfs_do_relatime = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_NBMAND:
|
|
|
|
vfsp->vfs_nbmand = B_TRUE;
|
|
|
|
vfsp->vfs_do_nbmand = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_NONBMAND:
|
|
|
|
vfsp->vfs_nbmand = B_FALSE;
|
|
|
|
vfsp->vfs_do_nbmand = B_TRUE;
|
|
|
|
break;
|
|
|
|
case TOKEN_MNTPOINT:
|
|
|
|
vfsp->vfs_mntpoint = match_strdup(&args[0]);
|
|
|
|
if (vfsp->vfs_mntpoint == NULL)
|
|
|
|
return (SET_ERROR(ENOMEM));
|
|
|
|
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Parse the raw mntopts and return a vfs_t describing the options.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
zfsvfs_parse_options(char *mntopts, vfs_t **vfsp)
|
|
|
|
{
|
|
|
|
vfs_t *tmp_vfsp;
|
|
|
|
int error;
|
|
|
|
|
|
|
|
tmp_vfsp = kmem_zalloc(sizeof (vfs_t), KM_SLEEP);
|
|
|
|
|
|
|
|
if (mntopts != NULL) {
|
|
|
|
substring_t args[MAX_OPT_ARGS];
|
|
|
|
char *tmp_mntopts, *p, *t;
|
|
|
|
int token;
|
|
|
|
|
2019-10-10 19:47:06 +03:00
|
|
|
tmp_mntopts = t = kmem_strdup(mntopts);
|
2017-03-09 03:56:09 +03:00
|
|
|
if (tmp_mntopts == NULL)
|
|
|
|
return (SET_ERROR(ENOMEM));
|
|
|
|
|
|
|
|
while ((p = strsep(&t, ",")) != NULL) {
|
|
|
|
if (!*p)
|
|
|
|
continue;
|
|
|
|
|
|
|
|
args[0].to = args[0].from = NULL;
|
|
|
|
token = match_token(p, zpl_tokens, args);
|
|
|
|
error = zfsvfs_parse_option(p, token, args, tmp_vfsp);
|
|
|
|
if (error) {
|
2019-10-10 19:47:06 +03:00
|
|
|
kmem_strfree(tmp_mntopts);
|
2017-03-09 03:56:09 +03:00
|
|
|
zfsvfs_vfs_free(tmp_vfsp);
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2019-10-10 19:47:06 +03:00
|
|
|
kmem_strfree(tmp_mntopts);
|
2017-03-09 03:56:09 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
*vfsp = tmp_vfsp;
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
boolean_t
|
|
|
|
zfs_is_readonly(zfsvfs_t *zfsvfs)
|
|
|
|
{
|
2019-01-11 02:28:44 +03:00
|
|
|
return (!!(zfsvfs->z_sb->s_flags & SB_RDONLY));
|
2017-03-09 03:56:09 +03:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
int
|
2011-03-15 22:03:42 +03:00
|
|
|
zfs_sync(struct super_block *sb, int wait, cred_t *cr)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2022-02-16 04:38:43 +03:00
|
|
|
(void) cr;
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs_t *zfsvfs = sb->s_fs_info;
|
2011-03-15 22:03:42 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Semantically, the only requirement is that the sync be initiated.
|
|
|
|
* The DMU syncs out txgs frequently, so there's nothing to do.
|
|
|
|
*/
|
|
|
|
if (!wait)
|
|
|
|
return (0);
|
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
if (zfsvfs != NULL) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* Sync a specific filesystem.
|
|
|
|
*/
|
2009-07-03 02:44:48 +04:00
|
|
|
dsl_pool_t *dp;
|
2022-09-16 23:36:47 +03:00
|
|
|
int error;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-09-16 23:36:47 +03:00
|
|
|
if ((error = zfs_enter(zfsvfs, FTAG)) != 0)
|
|
|
|
return (error);
|
2017-03-08 03:21:37 +03:00
|
|
|
dp = dmu_objset_pool(zfsvfs->z_os);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If the system is shutting down, then skip any
|
|
|
|
* filesystems which may exist on a suspended pool.
|
|
|
|
*/
|
2011-03-15 22:03:42 +03:00
|
|
|
if (spa_suspended(dp->dp_spa)) {
|
2022-09-16 23:36:47 +03:00
|
|
|
zfs_exit(zfsvfs, FTAG);
|
2009-07-03 02:44:48 +04:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
if (zfsvfs->z_log != NULL)
|
|
|
|
zil_commit(zfsvfs->z_log, 0);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2022-09-16 23:36:47 +03:00
|
|
|
zfs_exit(zfsvfs, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
|
|
|
/*
|
|
|
|
* Sync all ZFS filesystems. This is what happens when you
|
2020-10-30 18:55:59 +03:00
|
|
|
* run sync(1). Unlike other filesystems, ZFS honors the
|
2008-11-20 23:01:55 +03:00
|
|
|
* request by waiting for all pools to commit all dirty data.
|
|
|
|
*/
|
|
|
|
spa_sync_allpools();
|
|
|
|
}
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
atime_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
Fix `zfs set atime|relatime=off|on` behavior on inherited datasets
`zfs set atime|relatime=off|on` doesn't disable or enable the property
on read for datasets whose property was inherited from parent, until
a dataset is once unmounted and mounted again.
(The properties start to work properly if a dataset is once unmounted
and mounted again. The difference comes from regular mount process,
e.g. via zpool import, uses mount options based on properties read
from ondisk layout for each dataset, whereas
`zfs set atime|relatime=off|on` just remounts a specified dataset.)
--
# zpool create p1 <device>
# zfs create p1/f1
# zfs set atime=off p1
# echo test > /p1/f1/test
# sync
# zfs list
NAME USED AVAIL REFER MOUNTPOINT
p1 176K 18.9G 25.5K /p1
p1/f1 26K 18.9G 26K /p1/f1
# zfs get atime
NAME PROPERTY VALUE SOURCE
p1 atime off local
p1/f1 atime off inherited from p1
# stat /p1/f1/test | grep Access | tail -1
Access: 2019-04-26 23:32:33.741205192 +0900
# cat /p1/f1/test
test
# stat /p1/f1/test | grep Access | tail -1
Access: 2019-04-26 23:32:50.173231861 +0900
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ changed by read(2)
--
The problem is that zfsvfs::z_atime which was probably intended to keep
incore atime state just gets updated by a callback function of "atime"
property change, atime_changed_cb(), and never used for anything else.
Since now that all file read and atime update use a common function
zpl_iter_read_common() -> file_accessed(), and whether to update atime
via ->dirty_inode() is determined by atime_needs_update(),
atime_needs_update() needs to return false once atime is turned off.
It currently continues to return true on `zfs set atime=off`.
Fix atime_changed_cb() by setting or dropping SB_NOATIME in VFS super
block depending on a new atime value, so that atime_needs_update() works
as expected after property change.
The same problem applies to "relatime" except that a self contained
relatime test is needed. This is because relatime_need_update() is based
on a mount option flag MNT_RELATIME, which doesn't exist in datasets
with inherited "relatime" property via `zfs set relatime=...`, hence it
needs its own relatime test zfs_relatime_need_update().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8674
Closes #8675
2019-05-07 20:06:30 +03:00
|
|
|
zfsvfs_t *zfsvfs = arg;
|
|
|
|
struct super_block *sb = zfsvfs->z_sb;
|
|
|
|
|
|
|
|
if (sb == NULL)
|
|
|
|
return;
|
|
|
|
/*
|
|
|
|
* Update SB_NOATIME bit in VFS super block. Since atime update is
|
|
|
|
* determined by atime_needs_update(), atime_needs_update() needs to
|
|
|
|
* return false if atime is turned off, and not unconditionally return
|
|
|
|
* false if atime is turned on.
|
|
|
|
*/
|
|
|
|
if (newval)
|
|
|
|
sb->s_flags &= ~SB_NOATIME;
|
|
|
|
else
|
|
|
|
sb->s_flags |= SB_NOATIME;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2014-01-18 23:00:53 +04:00
|
|
|
static void
|
|
|
|
relatime_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
((zfsvfs_t *)arg)->z_relatime = newval;
|
2014-01-18 23:00:53 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
xattr_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs_t *zfsvfs = arg;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2011-10-25 03:55:20 +04:00
|
|
|
if (newval == ZFS_XATTR_OFF) {
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_flags &= ~ZSB_XATTR;
|
2011-10-25 03:55:20 +04:00
|
|
|
} else {
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_flags |= ZSB_XATTR;
|
2011-10-25 03:55:20 +04:00
|
|
|
|
|
|
|
if (newval == ZFS_XATTR_SA)
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_xattr_sa = B_TRUE;
|
2011-10-25 03:55:20 +04:00
|
|
|
else
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_xattr_sa = B_FALSE;
|
2011-10-25 03:55:20 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2013-10-28 20:22:15 +04:00
|
|
|
static void
|
|
|
|
acltype_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs_t *zfsvfs = arg;
|
2013-10-28 20:22:15 +04:00
|
|
|
|
|
|
|
switch (newval) {
|
2020-10-14 07:25:48 +03:00
|
|
|
case ZFS_ACLTYPE_NFSV4:
|
2013-10-28 20:22:15 +04:00
|
|
|
case ZFS_ACLTYPE_OFF:
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_acl_type = ZFS_ACLTYPE_OFF;
|
2019-01-11 02:28:44 +03:00
|
|
|
zfsvfs->z_sb->s_flags &= ~SB_POSIXACL;
|
2013-10-28 20:22:15 +04:00
|
|
|
break;
|
2020-09-16 22:26:06 +03:00
|
|
|
case ZFS_ACLTYPE_POSIX:
|
2013-11-03 03:40:26 +04:00
|
|
|
#ifdef CONFIG_FS_POSIX_ACL
|
2020-09-16 22:26:06 +03:00
|
|
|
zfsvfs->z_acl_type = ZFS_ACLTYPE_POSIX;
|
2019-01-11 02:28:44 +03:00
|
|
|
zfsvfs->z_sb->s_flags |= SB_POSIXACL;
|
2013-11-03 03:40:26 +04:00
|
|
|
#else
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_acl_type = ZFS_ACLTYPE_OFF;
|
2019-01-11 02:28:44 +03:00
|
|
|
zfsvfs->z_sb->s_flags &= ~SB_POSIXACL;
|
2013-11-03 03:40:26 +04:00
|
|
|
#endif /* CONFIG_FS_POSIX_ACL */
|
2013-10-28 20:22:15 +04:00
|
|
|
break;
|
|
|
|
default:
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
blksz_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs_t *zfsvfs = arg;
|
|
|
|
ASSERT3U(newval, <=, spa_maxblocksize(dmu_objset_spa(zfsvfs->z_os)));
|
2014-11-03 23:15:08 +03:00
|
|
|
ASSERT3U(newval, >=, SPA_MINBLOCKSIZE);
|
|
|
|
ASSERT(ISP2(newval));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_max_blksz = newval;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
readonly_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs_t *zfsvfs = arg;
|
|
|
|
struct super_block *sb = zfsvfs->z_sb;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2011-05-19 22:44:07 +04:00
|
|
|
if (sb == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (newval)
|
2019-01-11 02:28:44 +03:00
|
|
|
sb->s_flags |= SB_RDONLY;
|
2011-05-19 22:44:07 +04:00
|
|
|
else
|
2019-01-11 02:28:44 +03:00
|
|
|
sb->s_flags &= ~SB_RDONLY;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
devices_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
setuid_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
exec_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
nbmand_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs_t *zfsvfs = arg;
|
|
|
|
struct super_block *sb = zfsvfs->z_sb;
|
2011-02-08 22:16:06 +03:00
|
|
|
|
2011-05-19 22:44:07 +04:00
|
|
|
if (sb == NULL)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (newval == TRUE)
|
2019-01-11 02:28:44 +03:00
|
|
|
sb->s_flags |= SB_MANDLOCK;
|
2011-05-19 22:44:07 +04:00
|
|
|
else
|
2019-01-11 02:28:44 +03:00
|
|
|
sb->s_flags &= ~SB_MANDLOCK;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
|
|
|
snapdir_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
((zfsvfs_t *)arg)->z_show_ctldir = newval;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2019-12-05 03:35:18 +03:00
|
|
|
static void
|
|
|
|
acl_mode_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
|
|
|
zfsvfs_t *zfsvfs = arg;
|
|
|
|
|
|
|
|
zfsvfs->z_acl_mode = newval;
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
static void
|
|
|
|
acl_inherit_changed_cb(void *arg, uint64_t newval)
|
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
((zfsvfs_t *)arg)->z_acl_inherit = newval;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-03-09 03:56:09 +03:00
|
|
|
static int
|
|
|
|
zfs_register_callbacks(vfs_t *vfsp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
struct dsl_dataset *ds = NULL;
|
2017-03-09 03:56:09 +03:00
|
|
|
objset_t *os = NULL;
|
|
|
|
zfsvfs_t *zfsvfs = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
int error = 0;
|
|
|
|
|
2017-03-09 03:56:09 +03:00
|
|
|
ASSERT(vfsp);
|
|
|
|
zfsvfs = vfsp->vfs_data;
|
2017-03-08 03:21:37 +03:00
|
|
|
ASSERT(zfsvfs);
|
2017-03-09 03:56:09 +03:00
|
|
|
os = zfsvfs->z_os;
|
2015-09-01 02:46:01 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* The act of registering our callbacks will destroy any mount
|
|
|
|
* options we may have. In order to enable temporary overrides
|
|
|
|
* of mount options, we stash away the current values and
|
|
|
|
* restore them after we register the callbacks.
|
|
|
|
*/
|
2017-03-08 03:21:37 +03:00
|
|
|
if (zfs_is_readonly(zfsvfs) || !spa_writeable(dmu_objset_spa(os))) {
|
2017-03-09 03:56:09 +03:00
|
|
|
vfsp->vfs_do_readonly = B_TRUE;
|
|
|
|
vfsp->vfs_readonly = B_TRUE;
|
2015-09-01 02:46:01 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Register property callbacks.
|
|
|
|
*
|
|
|
|
* It would probably be fine to just check for i/o error from
|
|
|
|
* the first prop_register(), but I guess I like to go
|
|
|
|
* overboard...
|
|
|
|
*/
|
|
|
|
ds = dmu_objset_ds(os);
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_pool_config_enter(dmu_objset_pool(os), FTAG);
|
2011-02-08 22:16:06 +03:00
|
|
|
error = dsl_prop_register(ds,
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_prop_to_name(ZFS_PROP_ATIME), atime_changed_cb, zfsvfs);
|
2014-02-11 17:34:17 +04:00
|
|
|
error = error ? error : dsl_prop_register(ds,
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_prop_to_name(ZFS_PROP_RELATIME), relatime_changed_cb, zfsvfs);
|
2008-11-20 23:01:55 +03:00
|
|
|
error = error ? error : dsl_prop_register(ds,
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_prop_to_name(ZFS_PROP_XATTR), xattr_changed_cb, zfsvfs);
|
2008-11-20 23:01:55 +03:00
|
|
|
error = error ? error : dsl_prop_register(ds,
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_prop_to_name(ZFS_PROP_RECORDSIZE), blksz_changed_cb, zfsvfs);
|
2008-11-20 23:01:55 +03:00
|
|
|
error = error ? error : dsl_prop_register(ds,
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_prop_to_name(ZFS_PROP_READONLY), readonly_changed_cb, zfsvfs);
|
2008-11-20 23:01:55 +03:00
|
|
|
error = error ? error : dsl_prop_register(ds,
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_prop_to_name(ZFS_PROP_DEVICES), devices_changed_cb, zfsvfs);
|
2008-11-20 23:01:55 +03:00
|
|
|
error = error ? error : dsl_prop_register(ds,
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_prop_to_name(ZFS_PROP_SETUID), setuid_changed_cb, zfsvfs);
|
2008-11-20 23:01:55 +03:00
|
|
|
error = error ? error : dsl_prop_register(ds,
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_prop_to_name(ZFS_PROP_EXEC), exec_changed_cb, zfsvfs);
|
2008-11-20 23:01:55 +03:00
|
|
|
error = error ? error : dsl_prop_register(ds,
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_prop_to_name(ZFS_PROP_SNAPDIR), snapdir_changed_cb, zfsvfs);
|
2008-11-20 23:01:55 +03:00
|
|
|
error = error ? error : dsl_prop_register(ds,
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_prop_to_name(ZFS_PROP_ACLTYPE), acltype_changed_cb, zfsvfs);
|
2019-12-05 03:35:18 +03:00
|
|
|
error = error ? error : dsl_prop_register(ds,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_ACLMODE), acl_mode_changed_cb, zfsvfs);
|
2013-10-28 20:22:15 +04:00
|
|
|
error = error ? error : dsl_prop_register(ds,
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_prop_to_name(ZFS_PROP_ACLINHERIT), acl_inherit_changed_cb,
|
|
|
|
zfsvfs);
|
2011-05-19 22:44:07 +04:00
|
|
|
error = error ? error : dsl_prop_register(ds,
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_prop_to_name(ZFS_PROP_NBMAND), nbmand_changed_cb, zfsvfs);
|
2013-09-04 16:00:57 +04:00
|
|
|
dsl_pool_config_exit(dmu_objset_pool(os), FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (error)
|
|
|
|
goto unregister;
|
|
|
|
|
2015-09-01 02:46:01 +03:00
|
|
|
/*
|
|
|
|
* Invoke our callbacks to restore temporary mount options.
|
|
|
|
*/
|
2017-03-09 03:56:09 +03:00
|
|
|
if (vfsp->vfs_do_readonly)
|
|
|
|
readonly_changed_cb(zfsvfs, vfsp->vfs_readonly);
|
|
|
|
if (vfsp->vfs_do_setuid)
|
|
|
|
setuid_changed_cb(zfsvfs, vfsp->vfs_setuid);
|
|
|
|
if (vfsp->vfs_do_exec)
|
|
|
|
exec_changed_cb(zfsvfs, vfsp->vfs_exec);
|
|
|
|
if (vfsp->vfs_do_devices)
|
|
|
|
devices_changed_cb(zfsvfs, vfsp->vfs_devices);
|
|
|
|
if (vfsp->vfs_do_xattr)
|
|
|
|
xattr_changed_cb(zfsvfs, vfsp->vfs_xattr);
|
|
|
|
if (vfsp->vfs_do_atime)
|
|
|
|
atime_changed_cb(zfsvfs, vfsp->vfs_atime);
|
|
|
|
if (vfsp->vfs_do_relatime)
|
|
|
|
relatime_changed_cb(zfsvfs, vfsp->vfs_relatime);
|
|
|
|
if (vfsp->vfs_do_nbmand)
|
|
|
|
nbmand_changed_cb(zfsvfs, vfsp->vfs_nbmand);
|
2013-07-17 20:15:46 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
|
|
|
|
unregister:
|
2017-03-08 03:21:37 +03:00
|
|
|
dsl_prop_unregister_all(ds, zfsvfs);
|
2011-02-08 22:16:06 +03:00
|
|
|
return (error);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2019-10-11 01:59:34 +03:00
|
|
|
/*
|
|
|
|
* Takes a dataset, a property, a value and that value's setpoint as
|
|
|
|
* found in the ZAP. Checks if the property has been changed in the vfs.
|
|
|
|
* If so, val and setpoint will be overwritten with updated content.
|
|
|
|
* Otherwise, they are left unchanged.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
zfs_get_temporary_prop(dsl_dataset_t *ds, zfs_prop_t zfs_prop, uint64_t *val,
|
|
|
|
char *setpoint)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
zfsvfs_t *zfvp;
|
|
|
|
vfs_t *vfsp;
|
|
|
|
objset_t *os;
|
|
|
|
uint64_t tmp = *val;
|
|
|
|
|
|
|
|
error = dmu_objset_from_ds(ds, &os);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
if (dmu_objset_type(os) != DMU_OST_ZFS)
|
|
|
|
return (EINVAL);
|
|
|
|
|
|
|
|
mutex_enter(&os->os_user_ptr_lock);
|
|
|
|
zfvp = dmu_objset_get_user(os);
|
|
|
|
mutex_exit(&os->os_user_ptr_lock);
|
|
|
|
if (zfvp == NULL)
|
|
|
|
return (ESRCH);
|
|
|
|
|
|
|
|
vfsp = zfvp->z_vfs;
|
|
|
|
|
|
|
|
switch (zfs_prop) {
|
|
|
|
case ZFS_PROP_ATIME:
|
|
|
|
if (vfsp->vfs_do_atime)
|
|
|
|
tmp = vfsp->vfs_atime;
|
|
|
|
break;
|
|
|
|
case ZFS_PROP_RELATIME:
|
|
|
|
if (vfsp->vfs_do_relatime)
|
|
|
|
tmp = vfsp->vfs_relatime;
|
|
|
|
break;
|
|
|
|
case ZFS_PROP_DEVICES:
|
|
|
|
if (vfsp->vfs_do_devices)
|
|
|
|
tmp = vfsp->vfs_devices;
|
|
|
|
break;
|
|
|
|
case ZFS_PROP_EXEC:
|
|
|
|
if (vfsp->vfs_do_exec)
|
|
|
|
tmp = vfsp->vfs_exec;
|
|
|
|
break;
|
|
|
|
case ZFS_PROP_SETUID:
|
|
|
|
if (vfsp->vfs_do_setuid)
|
|
|
|
tmp = vfsp->vfs_setuid;
|
|
|
|
break;
|
|
|
|
case ZFS_PROP_READONLY:
|
|
|
|
if (vfsp->vfs_do_readonly)
|
|
|
|
tmp = vfsp->vfs_readonly;
|
|
|
|
break;
|
|
|
|
case ZFS_PROP_XATTR:
|
|
|
|
if (vfsp->vfs_do_xattr)
|
|
|
|
tmp = vfsp->vfs_xattr;
|
|
|
|
break;
|
|
|
|
case ZFS_PROP_NBMAND:
|
|
|
|
if (vfsp->vfs_do_nbmand)
|
|
|
|
tmp = vfsp->vfs_nbmand;
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
return (ENOENT);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (tmp != *val) {
|
2022-12-23 08:00:38 +03:00
|
|
|
if (setpoint)
|
|
|
|
(void) strcpy(setpoint, "temporary");
|
2019-10-11 01:59:34 +03:00
|
|
|
*val = tmp;
|
|
|
|
}
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2016-05-11 06:49:02 +03:00
|
|
|
/*
|
|
|
|
* Associate this zfsvfs with the given objset, which must be owned.
|
|
|
|
* This will cache a bunch of on-disk state from the objset in the
|
|
|
|
* zfsvfs.
|
|
|
|
*/
|
|
|
|
static int
|
|
|
|
zfsvfs_init(zfsvfs_t *zfsvfs, objset_t *os)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
2016-05-11 06:49:02 +03:00
|
|
|
int error;
|
|
|
|
uint64_t val;
|
2015-09-01 02:46:01 +03:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_max_blksz = SPA_OLD_MAXBLOCKSIZE;
|
|
|
|
zfsvfs->z_show_ctldir = ZFS_SNAPDIR_VISIBLE;
|
|
|
|
zfsvfs->z_os = os;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
error = zfs_get_zplprop(os, ZFS_PROP_VERSION, &zfsvfs->z_version);
|
2016-05-11 06:49:02 +03:00
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
if (zfsvfs->z_version >
|
|
|
|
zfs_zpl_version_map(spa_version(dmu_objset_spa(os)))) {
|
|
|
|
(void) printk("Can't mount a version %lld file system "
|
|
|
|
"on a version %lld pool\n. Pool must be upgraded to mount "
|
2019-05-09 02:43:55 +03:00
|
|
|
"this file system.\n", (u_longlong_t)zfsvfs->z_version,
|
2016-05-11 06:49:02 +03:00
|
|
|
(u_longlong_t)spa_version(dmu_objset_spa(os)));
|
|
|
|
return (SET_ERROR(ENOTSUP));
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2016-05-11 06:49:02 +03:00
|
|
|
error = zfs_get_zplprop(os, ZFS_PROP_NORMALIZE, &val);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
zfsvfs->z_norm = (int)val;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2016-05-11 06:49:02 +03:00
|
|
|
error = zfs_get_zplprop(os, ZFS_PROP_UTF8ONLY, &val);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
zfsvfs->z_utf8 = (val != 0);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2016-05-11 06:49:02 +03:00
|
|
|
error = zfs_get_zplprop(os, ZFS_PROP_CASE, &val);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
zfsvfs->z_case = (uint_t)val;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2016-05-11 06:49:02 +03:00
|
|
|
if ((error = zfs_get_zplprop(os, ZFS_PROP_ACLTYPE, &val)) != 0)
|
|
|
|
return (error);
|
|
|
|
zfsvfs->z_acl_type = (uint_t)val;
|
2013-10-28 20:22:15 +04:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
/*
|
|
|
|
* Fold case on file systems that are always or sometimes case
|
|
|
|
* insensitive.
|
|
|
|
*/
|
2017-03-08 03:21:37 +03:00
|
|
|
if (zfsvfs->z_case == ZFS_CASE_INSENSITIVE ||
|
|
|
|
zfsvfs->z_case == ZFS_CASE_MIXED)
|
|
|
|
zfsvfs->z_norm |= U8_TEXTPREP_TOUPPER;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_use_fuids = USE_FUIDS(zfsvfs->z_version, zfsvfs->z_os);
|
|
|
|
zfsvfs->z_use_sa = USE_SA(zfsvfs->z_version, zfsvfs->z_os);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2016-05-11 06:49:02 +03:00
|
|
|
uint64_t sa_obj = 0;
|
2017-03-08 03:21:37 +03:00
|
|
|
if (zfsvfs->z_use_sa) {
|
2010-05-29 00:45:14 +04:00
|
|
|
/* should either have both of these objects or none */
|
|
|
|
error = zap_lookup(os, MASTER_NODE_OBJ, ZFS_SA_ATTRS, 8, 1,
|
|
|
|
&sa_obj);
|
2016-05-11 06:49:02 +03:00
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
2011-10-25 03:55:20 +04:00
|
|
|
|
2016-05-11 06:49:02 +03:00
|
|
|
error = zfs_get_zplprop(os, ZFS_PROP_XATTR, &val);
|
|
|
|
if ((error == 0) && (val == ZFS_XATTR_SA))
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_xattr_sa = B_TRUE;
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
error = zap_lookup(os, MASTER_NODE_OBJ, ZFS_ROOT_OBJ, 8, 1,
|
2017-03-08 03:21:37 +03:00
|
|
|
&zfsvfs->z_root);
|
2016-05-11 06:49:02 +03:00
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
2017-03-08 03:21:37 +03:00
|
|
|
ASSERT(zfsvfs->z_root != 0);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
error = zap_lookup(os, MASTER_NODE_OBJ, ZFS_UNLINKED_SET, 8, 1,
|
2017-03-08 03:21:37 +03:00
|
|
|
&zfsvfs->z_unlinkedobj);
|
2016-05-11 06:49:02 +03:00
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
error = zap_lookup(os, MASTER_NODE_OBJ,
|
|
|
|
zfs_userquota_prop_prefixes[ZFS_PROP_USERQUOTA],
|
2017-03-08 03:21:37 +03:00
|
|
|
8, 1, &zfsvfs->z_userquota_obj);
|
2016-05-11 06:49:02 +03:00
|
|
|
if (error == ENOENT)
|
|
|
|
zfsvfs->z_userquota_obj = 0;
|
|
|
|
else if (error != 0)
|
|
|
|
return (error);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
error = zap_lookup(os, MASTER_NODE_OBJ,
|
|
|
|
zfs_userquota_prop_prefixes[ZFS_PROP_GROUPQUOTA],
|
2017-03-08 03:21:37 +03:00
|
|
|
8, 1, &zfsvfs->z_groupquota_obj);
|
2016-05-11 06:49:02 +03:00
|
|
|
if (error == ENOENT)
|
|
|
|
zfsvfs->z_groupquota_obj = 0;
|
|
|
|
else if (error != 0)
|
|
|
|
return (error);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2018-02-14 01:54:54 +03:00
|
|
|
error = zap_lookup(os, MASTER_NODE_OBJ,
|
|
|
|
zfs_userquota_prop_prefixes[ZFS_PROP_PROJECTQUOTA],
|
|
|
|
8, 1, &zfsvfs->z_projectquota_obj);
|
|
|
|
if (error == ENOENT)
|
|
|
|
zfsvfs->z_projectquota_obj = 0;
|
|
|
|
else if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
2016-10-04 21:46:10 +03:00
|
|
|
error = zap_lookup(os, MASTER_NODE_OBJ,
|
|
|
|
zfs_userquota_prop_prefixes[ZFS_PROP_USEROBJQUOTA],
|
2017-03-08 03:21:37 +03:00
|
|
|
8, 1, &zfsvfs->z_userobjquota_obj);
|
2016-05-11 06:49:02 +03:00
|
|
|
if (error == ENOENT)
|
|
|
|
zfsvfs->z_userobjquota_obj = 0;
|
|
|
|
else if (error != 0)
|
|
|
|
return (error);
|
2016-10-04 21:46:10 +03:00
|
|
|
|
|
|
|
error = zap_lookup(os, MASTER_NODE_OBJ,
|
|
|
|
zfs_userquota_prop_prefixes[ZFS_PROP_GROUPOBJQUOTA],
|
2017-03-08 03:21:37 +03:00
|
|
|
8, 1, &zfsvfs->z_groupobjquota_obj);
|
2016-05-11 06:49:02 +03:00
|
|
|
if (error == ENOENT)
|
|
|
|
zfsvfs->z_groupobjquota_obj = 0;
|
|
|
|
else if (error != 0)
|
|
|
|
return (error);
|
2016-10-04 21:46:10 +03:00
|
|
|
|
2018-02-14 01:54:54 +03:00
|
|
|
error = zap_lookup(os, MASTER_NODE_OBJ,
|
|
|
|
zfs_userquota_prop_prefixes[ZFS_PROP_PROJECTOBJQUOTA],
|
|
|
|
8, 1, &zfsvfs->z_projectobjquota_obj);
|
|
|
|
if (error == ENOENT)
|
|
|
|
zfsvfs->z_projectobjquota_obj = 0;
|
|
|
|
else if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
error = zap_lookup(os, MASTER_NODE_OBJ, ZFS_FUID_TABLES, 8, 1,
|
2017-03-08 03:21:37 +03:00
|
|
|
&zfsvfs->z_fuid_obj);
|
2016-05-11 06:49:02 +03:00
|
|
|
if (error == ENOENT)
|
|
|
|
zfsvfs->z_fuid_obj = 0;
|
|
|
|
else if (error != 0)
|
|
|
|
return (error);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
error = zap_lookup(os, MASTER_NODE_OBJ, ZFS_SHARES_DIR, 8, 1,
|
2017-03-08 03:21:37 +03:00
|
|
|
&zfsvfs->z_shares_dir);
|
2016-05-11 06:49:02 +03:00
|
|
|
if (error == ENOENT)
|
|
|
|
zfsvfs->z_shares_dir = 0;
|
|
|
|
else if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
2019-03-15 04:14:36 +03:00
|
|
|
error = sa_setup(os, sa_obj, zfs_attr_table, ZPL_END,
|
|
|
|
&zfsvfs->z_attr_table);
|
|
|
|
if (error != 0)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
if (zfsvfs->z_version >= ZPL_VERSION_SA)
|
|
|
|
sa_register_update_callback(os, zfs_sa_upgrade);
|
|
|
|
|
2016-05-11 06:49:02 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
2018-02-21 03:27:31 +03:00
|
|
|
zfsvfs_create(const char *osname, boolean_t readonly, zfsvfs_t **zfvp)
|
2016-05-11 06:49:02 +03:00
|
|
|
{
|
|
|
|
objset_t *os;
|
|
|
|
zfsvfs_t *zfsvfs;
|
|
|
|
int error;
|
2018-02-21 03:27:31 +03:00
|
|
|
boolean_t ro = (readonly || (strchr(osname, '@') != NULL));
|
2016-05-11 06:49:02 +03:00
|
|
|
|
|
|
|
zfsvfs = kmem_zalloc(sizeof (zfsvfs_t), KM_SLEEP);
|
|
|
|
|
2018-02-21 03:27:31 +03:00
|
|
|
error = dmu_objset_own(osname, DMU_OST_ZFS, ro, B_TRUE, zfsvfs, &os);
|
2018-02-08 19:16:23 +03:00
|
|
|
if (error != 0) {
|
2016-05-11 06:49:02 +03:00
|
|
|
kmem_free(zfsvfs, sizeof (zfsvfs_t));
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2018-02-08 19:16:23 +03:00
|
|
|
error = zfsvfs_create_impl(zfvp, zfsvfs, os);
|
2022-09-20 03:30:58 +03:00
|
|
|
|
2018-02-08 19:16:23 +03:00
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2019-03-15 04:14:36 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Note: zfsvfs is assumed to be malloc'd, and will be freed by this function
|
|
|
|
* on a failure. Do not pass in a statically allocated zfsvfs.
|
|
|
|
*/
|
2018-02-08 19:16:23 +03:00
|
|
|
int
|
|
|
|
zfsvfs_create_impl(zfsvfs_t **zfvp, zfsvfs_t *zfsvfs, objset_t *os)
|
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
2016-05-11 06:49:02 +03:00
|
|
|
zfsvfs->z_vfs = NULL;
|
|
|
|
zfsvfs->z_sb = NULL;
|
|
|
|
zfsvfs->z_parent = zfsvfs;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
mutex_init(&zfsvfs->z_znodes_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&zfsvfs->z_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
list_create(&zfsvfs->z_all_znodes, sizeof (znode_t),
|
2009-07-03 02:44:48 +04:00
|
|
|
offsetof(znode_t, z_link_node));
|
2020-11-05 01:23:48 +03:00
|
|
|
ZFS_TEARDOWN_INIT(zfsvfs);
|
2017-03-08 03:21:37 +03:00
|
|
|
rw_init(&zfsvfs->z_teardown_inactive_lock, NULL, RW_DEFAULT, NULL);
|
|
|
|
rw_init(&zfsvfs->z_fuid_lock, NULL, RW_DEFAULT, NULL);
|
2015-03-16 04:21:21 +03:00
|
|
|
|
2016-05-11 06:49:02 +03:00
|
|
|
int size = MIN(1 << (highbit64(zfs_object_mutex_size) - 1),
|
|
|
|
ZFS_OBJ_MTX_MAX);
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_hold_size = size;
|
|
|
|
zfsvfs->z_hold_trees = vmem_zalloc(sizeof (avl_tree_t) * size,
|
|
|
|
KM_SLEEP);
|
|
|
|
zfsvfs->z_hold_locks = vmem_zalloc(sizeof (kmutex_t) * size, KM_SLEEP);
|
2016-05-11 06:49:02 +03:00
|
|
|
for (int i = 0; i != size; i++) {
|
2017-03-08 03:21:37 +03:00
|
|
|
avl_create(&zfsvfs->z_hold_trees[i], zfs_znode_hold_compare,
|
2015-12-23 00:47:38 +03:00
|
|
|
sizeof (znode_hold_t), offsetof(znode_hold_t, zh_node));
|
2017-03-08 03:21:37 +03:00
|
|
|
mutex_init(&zfsvfs->z_hold_locks[i], NULL, MUTEX_DEFAULT, NULL);
|
2015-12-23 00:47:38 +03:00
|
|
|
}
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2016-05-11 06:49:02 +03:00
|
|
|
error = zfsvfs_init(zfsvfs, os);
|
|
|
|
if (error != 0) {
|
2022-09-20 03:30:58 +03:00
|
|
|
dmu_objset_disown(os, B_TRUE, zfsvfs);
|
2016-05-11 06:49:02 +03:00
|
|
|
*zfvp = NULL;
|
2019-03-15 04:14:36 +03:00
|
|
|
zfsvfs_free(zfsvfs);
|
2016-05-11 06:49:02 +03:00
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2019-02-12 21:41:15 +03:00
|
|
|
zfsvfs->z_drain_task = TASKQID_INVALID;
|
|
|
|
zfsvfs->z_draining = B_FALSE;
|
|
|
|
zfsvfs->z_drain_cancel = B_TRUE;
|
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
*zfvp = zfsvfs;
|
2009-07-03 02:44:48 +04:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2016-05-11 06:49:02 +03:00
|
|
|
static int
|
2017-03-09 01:56:19 +03:00
|
|
|
zfsvfs_setup(zfsvfs_t *zfsvfs, boolean_t mounting)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int error;
|
2017-11-08 22:12:59 +03:00
|
|
|
boolean_t readonly = zfs_is_readonly(zfsvfs);
|
|
|
|
|
2017-03-09 03:56:09 +03:00
|
|
|
error = zfs_register_callbacks(zfsvfs->z_vfs);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are not mounting (ie: online recv), then we don't
|
|
|
|
* have to worry about replaying the log as we blocked all
|
|
|
|
* operations out since we closed the ZIL.
|
|
|
|
*/
|
|
|
|
if (mounting) {
|
2019-02-12 21:41:15 +03:00
|
|
|
ASSERT3P(zfsvfs->z_kstat.dk_kstats, ==, NULL);
|
2022-07-21 03:14:06 +03:00
|
|
|
error = dataset_kstats_create(&zfsvfs->z_kstat, zfsvfs->z_os);
|
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
zfsvfs->z_log = zil_open(zfsvfs->z_os, zfs_get_data,
|
|
|
|
&zfsvfs->z_kstat.dk_zil_sums);
|
2019-02-12 21:41:15 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* During replay we remove the read only flag to
|
|
|
|
* allow replays to succeed.
|
|
|
|
*/
|
2019-02-12 21:41:15 +03:00
|
|
|
if (readonly != 0) {
|
2017-03-08 03:21:37 +03:00
|
|
|
readonly_changed_cb(zfsvfs, B_FALSE);
|
2019-02-12 21:41:15 +03:00
|
|
|
} else {
|
|
|
|
zap_stats_t zs;
|
|
|
|
if (zap_get_stats(zfsvfs->z_os, zfsvfs->z_unlinkedobj,
|
|
|
|
&zs) == 0) {
|
|
|
|
dataset_kstats_update_nunlinks_kstat(
|
|
|
|
&zfsvfs->z_kstat, zs.zs_num_entries);
|
2020-06-04 01:18:07 +03:00
|
|
|
dprintf_ds(zfsvfs->z_os->os_dsl_dataset,
|
|
|
|
"num_entries in unlinked set: %llu",
|
|
|
|
zs.zs_num_entries);
|
2019-02-12 21:41:15 +03:00
|
|
|
}
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_unlinked_drain(zfsvfs);
|
2020-04-01 20:02:06 +03:00
|
|
|
dsl_dir_t *dd = zfsvfs->z_os->os_dsl_dataset->ds_dir;
|
|
|
|
dd->dd_activity_cancelled = B_FALSE;
|
2019-02-12 21:41:15 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
|
|
|
* Parse and replay the intent log.
|
|
|
|
*
|
|
|
|
* Because of ziltest, this must be done after
|
|
|
|
* zfs_unlinked_drain(). (Further note: ziltest
|
|
|
|
* doesn't use readonly mounts, where
|
|
|
|
* zfs_unlinked_drain() isn't called.) This is because
|
|
|
|
* ziltest causes spa_sync() to think it's committed,
|
|
|
|
* but actually it is not, so the intent log contains
|
|
|
|
* many txg's worth of changes.
|
|
|
|
*
|
|
|
|
* In particular, if object N is in the unlinked set in
|
|
|
|
* the last txg to actually sync, then it could be
|
|
|
|
* actually freed in a later txg and then reallocated
|
|
|
|
* in a yet later txg. This would write a "create
|
|
|
|
* object N" record to the intent log. Normally, this
|
|
|
|
* would be fine because the spa_sync() would have
|
|
|
|
* written out the fact that object N is free, before
|
|
|
|
* we could write the "create object N" intent log
|
|
|
|
* record.
|
|
|
|
*
|
|
|
|
* But when we are in ziltest mode, we advance the "open
|
|
|
|
* txg" without actually spa_sync()-ing the changes to
|
|
|
|
* disk. So we would see that object N is still
|
|
|
|
* allocated and in the unlinked set, and there is an
|
|
|
|
* intent log record saying to allocate it.
|
|
|
|
*/
|
2017-03-08 03:21:37 +03:00
|
|
|
if (spa_writeable(dmu_objset_spa(zfsvfs->z_os))) {
|
2010-08-27 01:24:34 +04:00
|
|
|
if (zil_replay_disable) {
|
2017-03-08 03:21:37 +03:00
|
|
|
zil_destroy(zfsvfs->z_log, B_FALSE);
|
2010-08-27 01:24:34 +04:00
|
|
|
} else {
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_replay = B_TRUE;
|
|
|
|
zil_replay(zfsvfs->z_os, zfsvfs,
|
2010-08-27 01:24:34 +04:00
|
|
|
zfs_replay_vector);
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_replay = B_FALSE;
|
2010-08-27 01:24:34 +04:00
|
|
|
}
|
2009-01-16 00:59:39 +03:00
|
|
|
}
|
2011-05-19 22:44:07 +04:00
|
|
|
|
|
|
|
/* restore readonly bit */
|
|
|
|
if (readonly != 0)
|
2017-03-08 03:21:37 +03:00
|
|
|
readonly_changed_cb(zfsvfs, B_TRUE);
|
2022-07-21 03:14:06 +03:00
|
|
|
} else {
|
|
|
|
ASSERT3P(zfsvfs->z_kstat.dk_kstats, !=, NULL);
|
|
|
|
zfsvfs->z_log = zil_open(zfsvfs->z_os, zfs_get_data,
|
|
|
|
&zfsvfs->z_kstat.dk_zil_sums);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-01-14 02:29:32 +03:00
|
|
|
/*
|
2017-03-08 03:21:37 +03:00
|
|
|
* Set the objset user_ptr to track its zfsvfs.
|
2017-01-14 02:29:32 +03:00
|
|
|
*/
|
2017-03-08 03:21:37 +03:00
|
|
|
mutex_enter(&zfsvfs->z_os->os_user_ptr_lock);
|
|
|
|
dmu_objset_set_user(zfsvfs->z_os, zfsvfs);
|
|
|
|
mutex_exit(&zfsvfs->z_os->os_user_ptr_lock);
|
2017-01-14 02:29:32 +03:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
void
|
2017-03-09 01:56:19 +03:00
|
|
|
zfsvfs_free(zfsvfs_t *zfsvfs)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
int i, size = zfsvfs->z_hold_size;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_fuid_destroy(zfsvfs);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
mutex_destroy(&zfsvfs->z_znodes_lock);
|
|
|
|
mutex_destroy(&zfsvfs->z_lock);
|
|
|
|
list_destroy(&zfsvfs->z_all_znodes);
|
2020-11-05 01:23:48 +03:00
|
|
|
ZFS_TEARDOWN_DESTROY(zfsvfs);
|
2017-03-08 03:21:37 +03:00
|
|
|
rw_destroy(&zfsvfs->z_teardown_inactive_lock);
|
|
|
|
rw_destroy(&zfsvfs->z_fuid_lock);
|
2015-12-23 00:47:38 +03:00
|
|
|
for (i = 0; i != size; i++) {
|
2017-03-08 03:21:37 +03:00
|
|
|
avl_destroy(&zfsvfs->z_hold_trees[i]);
|
|
|
|
mutex_destroy(&zfsvfs->z_hold_locks[i]);
|
2015-12-23 00:47:38 +03:00
|
|
|
}
|
2017-03-08 03:21:37 +03:00
|
|
|
vmem_free(zfsvfs->z_hold_trees, sizeof (avl_tree_t) * size);
|
|
|
|
vmem_free(zfsvfs->z_hold_locks, sizeof (kmutex_t) * size);
|
2017-03-09 03:56:09 +03:00
|
|
|
zfsvfs_vfs_free(zfsvfs->z_vfs);
|
2018-08-20 19:52:37 +03:00
|
|
|
dataset_kstats_destroy(&zfsvfs->z_kstat);
|
2017-03-08 03:21:37 +03:00
|
|
|
kmem_free(zfsvfs, sizeof (zfsvfs_t));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
static void
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_set_fuid_feature(zfsvfs_t *zfsvfs)
|
2009-07-03 02:44:48 +04:00
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_use_fuids = USE_FUIDS(zfsvfs->z_version, zfsvfs->z_os);
|
|
|
|
zfsvfs->z_use_sa = USE_SA(zfsvfs->z_version, zfsvfs->z_os);
|
2009-07-03 02:44:48 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-06-15 21:30:37 +03:00
|
|
|
static void
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_unregister_callbacks(zfsvfs_t *zfsvfs)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
objset_t *os = zfsvfs->z_os;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2015-11-05 02:00:58 +03:00
|
|
|
if (!dmu_objset_is_snapshot(os))
|
2017-03-08 03:21:37 +03:00
|
|
|
dsl_prop_unregister_all(dmu_objset_ds(os), zfsvfs);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2011-01-07 03:47:31 +03:00
|
|
|
#ifdef HAVE_MLSLABEL
|
2010-05-29 00:45:14 +04:00
|
|
|
/*
|
2013-06-11 21:12:34 +04:00
|
|
|
* Check that the hex label string is appropriate for the dataset being
|
|
|
|
* mounted into the global_zone proper.
|
2010-05-29 00:45:14 +04:00
|
|
|
*
|
2013-06-11 21:12:34 +04:00
|
|
|
* Return an error if the hex label string is not default or
|
|
|
|
* admin_low/admin_high. For admin_low labels, the corresponding
|
|
|
|
* dataset must be readonly.
|
2010-05-29 00:45:14 +04:00
|
|
|
*/
|
|
|
|
int
|
|
|
|
zfs_check_global_label(const char *dsname, const char *hexsl)
|
|
|
|
{
|
|
|
|
if (strcasecmp(hexsl, ZFS_MLSLABEL_DEFAULT) == 0)
|
|
|
|
return (0);
|
|
|
|
if (strcasecmp(hexsl, ADMIN_HIGH) == 0)
|
|
|
|
return (0);
|
|
|
|
if (strcasecmp(hexsl, ADMIN_LOW) == 0) {
|
|
|
|
/* must be readonly */
|
|
|
|
uint64_t rdonly;
|
|
|
|
|
|
|
|
if (dsl_prop_get_integer(dsname,
|
|
|
|
zfs_prop_to_name(ZFS_PROP_READONLY), &rdonly, NULL))
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EACCES));
|
2020-02-27 03:09:17 +03:00
|
|
|
return (rdonly ? 0 : SET_ERROR(EACCES));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EACCES));
|
2010-05-29 00:45:14 +04:00
|
|
|
}
|
2011-01-07 03:47:31 +03:00
|
|
|
#endif /* HAVE_MLSLABEL */
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2018-02-14 01:54:54 +03:00
|
|
|
static int
|
|
|
|
zfs_statfs_project(zfsvfs_t *zfsvfs, znode_t *zp, struct kstatfs *statp,
|
|
|
|
uint32_t bshift)
|
|
|
|
{
|
|
|
|
char buf[20 + DMU_OBJACCT_PREFIX_LEN];
|
|
|
|
uint64_t offset = DMU_OBJACCT_PREFIX_LEN;
|
|
|
|
uint64_t quota;
|
|
|
|
uint64_t used;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
strlcpy(buf, DMU_OBJACCT_PREFIX, DMU_OBJACCT_PREFIX_LEN + 1);
|
2019-12-11 23:12:08 +03:00
|
|
|
err = zfs_id_to_fuidstr(zfsvfs, NULL, zp->z_projid, buf + offset,
|
2020-06-07 21:42:12 +03:00
|
|
|
sizeof (buf) - offset, B_FALSE);
|
2018-02-14 01:54:54 +03:00
|
|
|
if (err)
|
|
|
|
return (err);
|
|
|
|
|
|
|
|
if (zfsvfs->z_projectquota_obj == 0)
|
|
|
|
goto objs;
|
|
|
|
|
|
|
|
err = zap_lookup(zfsvfs->z_os, zfsvfs->z_projectquota_obj,
|
|
|
|
buf + offset, 8, 1, "a);
|
|
|
|
if (err == ENOENT)
|
|
|
|
goto objs;
|
|
|
|
else if (err)
|
|
|
|
return (err);
|
|
|
|
|
|
|
|
err = zap_lookup(zfsvfs->z_os, DMU_PROJECTUSED_OBJECT,
|
|
|
|
buf + offset, 8, 1, &used);
|
|
|
|
if (unlikely(err == ENOENT)) {
|
|
|
|
uint32_t blksize;
|
|
|
|
u_longlong_t nblocks;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Quota accounting is async, so it is possible race case.
|
|
|
|
* There is at least one object with the given project ID.
|
|
|
|
*/
|
|
|
|
sa_object_size(zp->z_sa_hdl, &blksize, &nblocks);
|
|
|
|
if (unlikely(zp->z_blksz == 0))
|
|
|
|
blksize = zfsvfs->z_max_blksz;
|
|
|
|
|
|
|
|
used = blksize * nblocks;
|
|
|
|
} else if (err) {
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
|
|
|
statp->f_blocks = quota >> bshift;
|
|
|
|
statp->f_bfree = (quota > used) ? ((quota - used) >> bshift) : 0;
|
|
|
|
statp->f_bavail = statp->f_bfree;
|
|
|
|
|
|
|
|
objs:
|
|
|
|
if (zfsvfs->z_projectobjquota_obj == 0)
|
|
|
|
return (0);
|
|
|
|
|
|
|
|
err = zap_lookup(zfsvfs->z_os, zfsvfs->z_projectobjquota_obj,
|
|
|
|
buf + offset, 8, 1, "a);
|
|
|
|
if (err == ENOENT)
|
|
|
|
return (0);
|
|
|
|
else if (err)
|
|
|
|
return (err);
|
|
|
|
|
|
|
|
err = zap_lookup(zfsvfs->z_os, DMU_PROJECTUSED_OBJECT,
|
|
|
|
buf, 8, 1, &used);
|
|
|
|
if (unlikely(err == ENOENT)) {
|
|
|
|
/*
|
|
|
|
* Quota accounting is async, so it is possible race case.
|
|
|
|
* There is at least one object with the given project ID.
|
|
|
|
*/
|
|
|
|
used = 1;
|
|
|
|
} else if (err) {
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
|
|
|
statp->f_files = quota;
|
|
|
|
statp->f_ffree = (quota > used) ? (quota - used) : 0;
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2010-12-17 22:18:08 +03:00
|
|
|
int
|
linux: add basic fallocate(mode=0/2) compatibility
Implement semi-compatible functionality for mode=0 (preallocation)
and mode=FALLOC_FL_KEEP_SIZE (preallocation beyond EOF) for ZPL.
Since ZFS does COW and snapshots, preallocating blocks for a file
cannot guarantee that writes to the file will not run out of space.
Even if the first overwrite was guaranteed, it would not handle any
later overwrite of blocks due to COW, so strict compliance is futile.
Instead, make a best-effort check that at least enough free space is
currently available in the pool (with a bit of margin), then create
a sparse file of the requested size and continue on with life.
This does not handle all cases (e.g. several fallocate() calls before
writing into the files when the filesystem is nearly full), which
would require a more complex mechanism to be implemented, probably
based on a modified version of dmu_prealloc(), but is usable as-is.
A new module option zfs_fallocate_reserve_percent is used to control
the reserve margin for any single fallocate call. By default, this
is 110% of the requested preallocation size, so an additional 10% of
available space is reserved for overhead to allow the application a
good chance of finishing the write when the fallocate() succeeds.
If the heuristics of this basic fallocate implementation are not
desirable, the old non-functional behavior of returning EOPNOTSUPP
for calls can be restored by setting zfs_fallocate_reserve_percent=0.
The parameter of zfs_statvfs() is changed to take an inode instead
of a dentry, since no dentry is available in zfs_fallocate_common().
A few tests from @behlendorf cover basic fallocate functionality.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Issue #326
Closes #10408
2020-06-18 21:22:11 +03:00
|
|
|
zfs_statvfs(struct inode *ip, struct kstatfs *statp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
linux: add basic fallocate(mode=0/2) compatibility
Implement semi-compatible functionality for mode=0 (preallocation)
and mode=FALLOC_FL_KEEP_SIZE (preallocation beyond EOF) for ZPL.
Since ZFS does COW and snapshots, preallocating blocks for a file
cannot guarantee that writes to the file will not run out of space.
Even if the first overwrite was guaranteed, it would not handle any
later overwrite of blocks due to COW, so strict compliance is futile.
Instead, make a best-effort check that at least enough free space is
currently available in the pool (with a bit of margin), then create
a sparse file of the requested size and continue on with life.
This does not handle all cases (e.g. several fallocate() calls before
writing into the files when the filesystem is nearly full), which
would require a more complex mechanism to be implemented, probably
based on a modified version of dmu_prealloc(), but is usable as-is.
A new module option zfs_fallocate_reserve_percent is used to control
the reserve margin for any single fallocate call. By default, this
is 110% of the requested preallocation size, so an additional 10% of
available space is reserved for overhead to allow the application a
good chance of finishing the write when the fallocate() succeeds.
If the heuristics of this basic fallocate implementation are not
desirable, the old non-functional behavior of returning EOPNOTSUPP
for calls can be restored by setting zfs_fallocate_reserve_percent=0.
The parameter of zfs_statvfs() is changed to take an inode instead
of a dentry, since no dentry is available in zfs_fallocate_common().
A few tests from @behlendorf cover basic fallocate functionality.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Issue #326
Closes #10408
2020-06-18 21:22:11 +03:00
|
|
|
zfsvfs_t *zfsvfs = ITOZSB(ip);
|
2008-11-20 23:01:55 +03:00
|
|
|
uint64_t refdbytes, availbytes, usedobjs, availobjs;
|
2018-02-14 01:54:54 +03:00
|
|
|
int err = 0;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-09-16 23:36:47 +03:00
|
|
|
if ((err = zfs_enter(zfsvfs, FTAG)) != 0)
|
|
|
|
return (err);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
dmu_objset_space(zfsvfs->z_os,
|
2008-11-20 23:01:55 +03:00
|
|
|
&refdbytes, &availbytes, &usedobjs, &availobjs);
|
|
|
|
|
2018-09-25 03:11:25 +03:00
|
|
|
uint64_t fsid = dmu_objset_fsid_guid(zfsvfs->z_os);
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2011-02-24 00:57:50 +03:00
|
|
|
* The underlying storage pool actually uses multiple block
|
|
|
|
* size. Under Solaris frsize (fragment size) is reported as
|
|
|
|
* the smallest block size we support, and bsize (block size)
|
|
|
|
* as the filesystem's maximum block size. Unfortunately,
|
|
|
|
* under Linux the fragment size and block size are often used
|
|
|
|
* interchangeably. Thus we are forced to report both of them
|
|
|
|
* as the filesystem's maximum block size.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2017-03-08 03:21:37 +03:00
|
|
|
statp->f_frsize = zfsvfs->z_max_blksz;
|
|
|
|
statp->f_bsize = zfsvfs->z_max_blksz;
|
2018-09-25 03:11:25 +03:00
|
|
|
uint32_t bshift = fls(statp->f_bsize) - 1;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2011-02-08 22:16:06 +03:00
|
|
|
* The following report "total" blocks of various kinds in
|
|
|
|
* the file system, but reported in terms of f_bsize - the
|
|
|
|
* "preferred" size.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
|
2019-09-03 03:56:41 +03:00
|
|
|
/* Round up so we never have a filesystem using 0 blocks. */
|
2019-01-13 21:06:13 +03:00
|
|
|
refdbytes = P2ROUNDUP(refdbytes, statp->f_bsize);
|
2011-02-08 22:16:06 +03:00
|
|
|
statp->f_blocks = (refdbytes + availbytes) >> bshift;
|
|
|
|
statp->f_bfree = availbytes >> bshift;
|
2008-11-20 23:01:55 +03:00
|
|
|
statp->f_bavail = statp->f_bfree; /* no root reservation */
|
|
|
|
|
|
|
|
/*
|
|
|
|
* statvfs() should really be called statufs(), because it assumes
|
|
|
|
* static metadata. ZFS doesn't preallocate files, so the best
|
|
|
|
* we can do is report the max that could possibly fit in f_files,
|
|
|
|
* and that minus the number actually used in f_ffree.
|
2018-09-25 03:11:25 +03:00
|
|
|
* For f_ffree, report the smaller of the number of objects available
|
2008-11-20 23:01:55 +03:00
|
|
|
* and the number of blocks (each object will take at least a block).
|
|
|
|
*/
|
2011-09-16 13:22:00 +04:00
|
|
|
statp->f_ffree = MIN(availobjs, availbytes >> DNODE_SHIFT);
|
2008-11-20 23:01:55 +03:00
|
|
|
statp->f_files = statp->f_ffree + usedobjs;
|
2012-08-24 16:38:55 +04:00
|
|
|
statp->f_fsid.val[0] = (uint32_t)fsid;
|
|
|
|
statp->f_fsid.val[1] = (uint32_t)(fsid >> 32);
|
2011-02-08 22:16:06 +03:00
|
|
|
statp->f_type = ZFS_SUPER_MAGIC;
|
2016-06-16 00:28:36 +03:00
|
|
|
statp->f_namelen = MAXNAMELEN - 1;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2011-02-08 22:16:06 +03:00
|
|
|
* We have all of 40 characters to stuff a string here.
|
2008-11-20 23:01:55 +03:00
|
|
|
* Is there anything useful we could/should provide?
|
|
|
|
*/
|
2022-02-25 16:26:54 +03:00
|
|
|
memset(statp->f_spare, 0, sizeof (statp->f_spare));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2018-02-14 01:54:54 +03:00
|
|
|
if (dmu_objset_projectquota_enabled(zfsvfs->z_os) &&
|
|
|
|
dmu_objset_projectquota_present(zfsvfs->z_os)) {
|
linux: add basic fallocate(mode=0/2) compatibility
Implement semi-compatible functionality for mode=0 (preallocation)
and mode=FALLOC_FL_KEEP_SIZE (preallocation beyond EOF) for ZPL.
Since ZFS does COW and snapshots, preallocating blocks for a file
cannot guarantee that writes to the file will not run out of space.
Even if the first overwrite was guaranteed, it would not handle any
later overwrite of blocks due to COW, so strict compliance is futile.
Instead, make a best-effort check that at least enough free space is
currently available in the pool (with a bit of margin), then create
a sparse file of the requested size and continue on with life.
This does not handle all cases (e.g. several fallocate() calls before
writing into the files when the filesystem is nearly full), which
would require a more complex mechanism to be implemented, probably
based on a modified version of dmu_prealloc(), but is usable as-is.
A new module option zfs_fallocate_reserve_percent is used to control
the reserve margin for any single fallocate call. By default, this
is 110% of the requested preallocation size, so an additional 10% of
available space is reserved for overhead to allow the application a
good chance of finishing the write when the fallocate() succeeds.
If the heuristics of this basic fallocate implementation are not
desirable, the old non-functional behavior of returning EOPNOTSUPP
for calls can be restored by setting zfs_fallocate_reserve_percent=0.
The parameter of zfs_statvfs() is changed to take an inode instead
of a dentry, since no dentry is available in zfs_fallocate_common().
A few tests from @behlendorf cover basic fallocate functionality.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Arshad Hussain <arshad.super@gmail.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Issue #326
Closes #10408
2020-06-18 21:22:11 +03:00
|
|
|
znode_t *zp = ITOZ(ip);
|
2018-02-14 01:54:54 +03:00
|
|
|
|
|
|
|
if (zp->z_pflags & ZFS_PROJINHERIT && zp->z_projid &&
|
|
|
|
zpl_is_valid_projid(zp->z_projid))
|
|
|
|
err = zfs_statfs_project(zfsvfs, zp, statp, bshift);
|
|
|
|
}
|
|
|
|
|
2022-09-16 23:36:47 +03:00
|
|
|
zfs_exit(zfsvfs, FTAG);
|
2018-02-14 01:54:54 +03:00
|
|
|
return (err);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2020-06-15 21:30:37 +03:00
|
|
|
static int
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_root(zfsvfs_t *zfsvfs, struct inode **ipp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
znode_t *rootzp;
|
|
|
|
int error;
|
|
|
|
|
2022-09-16 23:36:47 +03:00
|
|
|
if ((error = zfs_enter(zfsvfs, FTAG)) != 0)
|
|
|
|
return (error);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
error = zfs_zget(zfsvfs, zfsvfs->z_root, &rootzp);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (error == 0)
|
2011-02-08 22:16:06 +03:00
|
|
|
*ipp = ZTOI(rootzp);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2022-09-16 23:36:47 +03:00
|
|
|
zfs_exit(zfsvfs, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2015-03-18 01:07:47 +03:00
|
|
|
/*
|
|
|
|
* The ARC has requested that the filesystem drop entries from the dentry
|
|
|
|
* and inode caches. This can occur when the ARC needs to free meta data
|
|
|
|
* blocks but can't because they are all pinned by entries in these caches.
|
|
|
|
*/
|
2023-12-16 09:39:07 +03:00
|
|
|
#if defined(HAVE_SUPER_BLOCK_S_SHRINK)
|
|
|
|
#define S_SHRINK(sb) (&(sb)->s_shrink)
|
|
|
|
#elif defined(HAVE_SUPER_BLOCK_S_SHRINK_PTR)
|
|
|
|
#define S_SHRINK(sb) ((sb)->s_shrink)
|
|
|
|
#endif
|
|
|
|
|
2011-12-23 00:20:43 +04:00
|
|
|
int
|
2017-03-09 01:56:19 +03:00
|
|
|
zfs_prune(struct super_block *sb, unsigned long nr_to_scan, int *objects)
|
2011-12-23 00:20:43 +04:00
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs_t *zfsvfs = sb->s_fs_info;
|
2015-03-18 01:07:47 +03:00
|
|
|
int error = 0;
|
2023-12-16 09:39:07 +03:00
|
|
|
struct shrinker *shrinker = S_SHRINK(sb);
|
2011-12-23 00:20:43 +04:00
|
|
|
struct shrink_control sc = {
|
|
|
|
.nr_to_scan = nr_to_scan,
|
|
|
|
.gfp_mask = GFP_KERNEL,
|
|
|
|
};
|
|
|
|
|
2022-09-16 23:36:47 +03:00
|
|
|
if ((error = zfs_enter(zfsvfs, FTAG)) != 0)
|
|
|
|
return (error);
|
2015-03-18 01:07:47 +03:00
|
|
|
|
2024-08-04 03:34:18 +03:00
|
|
|
#ifdef SHRINKER_NUMA_AWARE
|
2023-12-16 09:39:07 +03:00
|
|
|
if (shrinker->flags & SHRINKER_NUMA_AWARE) {
|
2024-08-09 01:33:36 +03:00
|
|
|
long tc = 1;
|
|
|
|
for_each_online_node(sc.nid) {
|
|
|
|
long c = shrinker->count_objects(shrinker, &sc);
|
|
|
|
if (c == 0 || c == SHRINK_EMPTY)
|
|
|
|
continue;
|
|
|
|
tc += c;
|
|
|
|
}
|
2015-06-14 19:19:40 +03:00
|
|
|
*objects = 0;
|
2016-12-12 21:46:26 +03:00
|
|
|
for_each_online_node(sc.nid) {
|
2024-08-09 01:33:36 +03:00
|
|
|
long c = shrinker->count_objects(shrinker, &sc);
|
|
|
|
if (c == 0 || c == SHRINK_EMPTY)
|
|
|
|
continue;
|
|
|
|
if (c > tc)
|
|
|
|
tc = c;
|
|
|
|
sc.nr_to_scan = mult_frac(nr_to_scan, c, tc) + 1;
|
2015-06-14 19:19:40 +03:00
|
|
|
*objects += (*shrinker->scan_objects)(shrinker, &sc);
|
2016-12-12 21:46:26 +03:00
|
|
|
}
|
2015-06-14 19:19:40 +03:00
|
|
|
} else {
|
|
|
|
*objects = (*shrinker->scan_objects)(shrinker, &sc);
|
|
|
|
}
|
2024-08-04 03:11:03 +03:00
|
|
|
#else
|
2014-12-18 19:08:47 +03:00
|
|
|
*objects = (*shrinker->scan_objects)(shrinker, &sc);
|
2016-06-16 18:19:32 +03:00
|
|
|
#endif
|
|
|
|
|
2022-09-16 23:36:47 +03:00
|
|
|
zfs_exit(zfsvfs, FTAG);
|
2011-12-23 00:20:43 +04:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
dprintf_ds(zfsvfs->z_os->os_dsl_dataset,
|
2015-03-18 01:07:47 +03:00
|
|
|
"pruning, nr_to_scan=%lu objects=%d error=%d\n",
|
|
|
|
nr_to_scan, *objects, error);
|
|
|
|
|
|
|
|
return (error);
|
2011-12-23 00:20:43 +04:00
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2017-03-08 03:21:37 +03:00
|
|
|
* Teardown the zfsvfs_t.
|
2008-11-20 23:01:55 +03:00
|
|
|
*
|
2017-01-27 22:46:39 +03:00
|
|
|
* Note, if 'unmounting' is FALSE, we return with the 'z_teardown_lock'
|
2008-11-20 23:01:55 +03:00
|
|
|
* and 'z_teardown_inactive_lock' held.
|
|
|
|
*/
|
2017-03-09 01:56:19 +03:00
|
|
|
static int
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs_teardown(zfsvfs_t *zfsvfs, boolean_t unmounting)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
znode_t *zp;
|
|
|
|
|
2019-02-12 21:41:15 +03:00
|
|
|
zfs_unlinked_drain_stop_wait(zfsvfs);
|
|
|
|
|
2014-01-08 22:25:42 +04:00
|
|
|
/*
|
|
|
|
* If someone has not already unmounted this file system,
|
2019-12-11 22:53:57 +03:00
|
|
|
* drain the zrele_taskq to ensure all active references to the
|
2017-03-08 03:21:37 +03:00
|
|
|
* zfsvfs_t have been handled only then can it be safely destroyed.
|
2014-01-08 22:25:42 +04:00
|
|
|
*/
|
2017-03-08 03:21:37 +03:00
|
|
|
if (zfsvfs->z_os) {
|
2015-05-02 08:47:06 +03:00
|
|
|
/*
|
|
|
|
* If we're unmounting we have to wait for the list to
|
|
|
|
* drain completely.
|
|
|
|
*
|
|
|
|
* If we're not unmounting there's no guarantee the list
|
|
|
|
* will drain completely, but iputs run from the taskq
|
|
|
|
* may add the parents of dir-based xattrs to the taskq
|
|
|
|
* so we want to wait for these.
|
|
|
|
*
|
2023-09-19 02:53:33 +03:00
|
|
|
* We can safely check z_all_znodes for being empty because the
|
|
|
|
* VFS has already blocked operations which add to it.
|
2015-05-02 08:47:06 +03:00
|
|
|
*/
|
|
|
|
int round = 0;
|
2023-09-19 02:53:33 +03:00
|
|
|
while (!list_is_empty(&zfsvfs->z_all_znodes)) {
|
2019-12-11 22:53:57 +03:00
|
|
|
taskq_wait_outstanding(dsl_pool_zrele_taskq(
|
2017-03-08 03:21:37 +03:00
|
|
|
dmu_objset_pool(zfsvfs->z_os)), 0);
|
2015-05-02 08:47:06 +03:00
|
|
|
if (++round > 1 && !unmounting)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
2014-01-08 22:25:42 +04:00
|
|
|
|
2020-11-05 01:23:48 +03:00
|
|
|
ZFS_TEARDOWN_ENTER_WRITE(zfsvfs, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
if (!unmounting) {
|
|
|
|
/*
|
2011-02-05 03:54:34 +03:00
|
|
|
* We purge the parent filesystem's super block as the
|
|
|
|
* parent filesystem and all of its snapshots have their
|
|
|
|
* inode's super block set to the parent's filesystem's
|
|
|
|
* super block. Note, 'z_parent' is self referential
|
|
|
|
* for non-snapshots.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2017-03-08 03:21:37 +03:00
|
|
|
shrink_dcache_sb(zfsvfs->z_parent->z_sb);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Close the zil. NB: Can't close the zil while zfs_inactive
|
|
|
|
* threads are blocked as zil_close can call zfs_inactive.
|
|
|
|
*/
|
2017-03-08 03:21:37 +03:00
|
|
|
if (zfsvfs->z_log) {
|
|
|
|
zil_close(zfsvfs->z_log);
|
|
|
|
zfsvfs->z_log = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_WRITER);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* If we are not unmounting (ie: online recv) and someone already
|
|
|
|
* unmounted this file system while we were doing the switcheroo,
|
|
|
|
* or a reopen of z_os failed then just bail out now.
|
|
|
|
*/
|
2017-03-08 03:21:37 +03:00
|
|
|
if (!unmounting && (zfsvfs->z_unmounted || zfsvfs->z_os == NULL)) {
|
|
|
|
rw_exit(&zfsvfs->z_teardown_inactive_lock);
|
2020-11-05 01:23:48 +03:00
|
|
|
ZFS_TEARDOWN_EXIT(zfsvfs, FTAG);
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EIO));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2013-01-16 04:41:09 +04:00
|
|
|
* At this point there are no VFS ops active, and any new VFS ops
|
|
|
|
* will fail with EIO since we have z_teardown_lock for writer (only
|
|
|
|
* relevant for forced unmount).
|
2008-11-20 23:01:55 +03:00
|
|
|
*
|
2019-08-27 19:55:51 +03:00
|
|
|
* Release all holds on dbufs. We also grab an extra reference to all
|
|
|
|
* the remaining inodes so that the kernel does not attempt to free
|
|
|
|
* any inodes of a suspended fs. This can cause deadlocks since the
|
|
|
|
* zfs_resume_fs() process may involve starting threads, which might
|
|
|
|
* attempt to free unreferenced inodes to free up memory for the new
|
|
|
|
* thread.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2015-05-02 08:47:06 +03:00
|
|
|
if (!unmounting) {
|
2017-03-08 03:21:37 +03:00
|
|
|
mutex_enter(&zfsvfs->z_znodes_lock);
|
|
|
|
for (zp = list_head(&zfsvfs->z_all_znodes); zp != NULL;
|
|
|
|
zp = list_next(&zfsvfs->z_all_znodes, zp)) {
|
2015-05-02 08:47:06 +03:00
|
|
|
if (zp->z_sa_hdl)
|
|
|
|
zfs_znode_dmu_fini(zp);
|
2019-08-27 19:55:51 +03:00
|
|
|
if (igrab(ZTOI(zp)) != NULL)
|
|
|
|
zp->z_suspended = B_TRUE;
|
|
|
|
|
2015-05-02 08:47:06 +03:00
|
|
|
}
|
2017-03-08 03:21:37 +03:00
|
|
|
mutex_exit(&zfsvfs->z_znodes_lock);
|
2013-01-16 04:41:09 +04:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2013-01-16 04:41:09 +04:00
|
|
|
* If we are unmounting, set the unmounted flag and let new VFS ops
|
2008-11-20 23:01:55 +03:00
|
|
|
* unblock. zfs_inactive will have the unmounted behavior, and all
|
2013-01-16 04:41:09 +04:00
|
|
|
* other VFS ops will fail with EIO.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
if (unmounting) {
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_unmounted = B_TRUE;
|
|
|
|
rw_exit(&zfsvfs->z_teardown_inactive_lock);
|
2020-11-05 01:23:48 +03:00
|
|
|
ZFS_TEARDOWN_EXIT(zfsvfs, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* z_os will be NULL if there was an error in attempting to reopen
|
2017-03-08 03:21:37 +03:00
|
|
|
* zfsvfs, so just return as the properties had already been
|
2011-02-08 22:16:06 +03:00
|
|
|
*
|
2008-11-20 23:01:55 +03:00
|
|
|
* unregistered and cached data had been evicted before.
|
|
|
|
*/
|
2017-03-08 03:21:37 +03:00
|
|
|
if (zfsvfs->z_os == NULL)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Unregister properties.
|
|
|
|
*/
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_unregister_callbacks(zfsvfs);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
Always wait for txg sync when umounting dataset
Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled.
While looking into fixes for this issue, it was discovered that
dsl_dataset_is_dirty() does not return B_TRUE when the dataset has
been removed from the txg dirty datasets list, but has not actually
been processed yet. Furthermore, the implementation is comletely
different from dmu_objset_is_dirty(), adding to the confusion.
Rather than relying on this function, this patch forces the umount
code path (and the remount readonly code path) to always perform a
txg sync on read-write datasets and removes the function altogether.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7753
Closes #7795
2018-08-20 23:42:17 +03:00
|
|
|
* Evict cached data. We must write out any dirty data before
|
|
|
|
* disowning the dataset.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2019-08-15 17:27:13 +03:00
|
|
|
objset_t *os = zfsvfs->z_os;
|
|
|
|
boolean_t os_dirty = B_FALSE;
|
|
|
|
for (int t = 0; t < TXG_SIZE; t++) {
|
|
|
|
if (dmu_objset_is_dirty(os, t)) {
|
|
|
|
os_dirty = B_TRUE;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!zfs_is_readonly(zfsvfs) && os_dirty) {
|
2017-03-08 03:21:37 +03:00
|
|
|
txg_wait_synced(dmu_objset_pool(zfsvfs->z_os), 0);
|
2019-08-15 17:27:13 +03:00
|
|
|
}
|
2017-03-08 03:21:37 +03:00
|
|
|
dmu_objset_evict_dbufs(zfsvfs->z_os);
|
2020-04-01 20:02:06 +03:00
|
|
|
dsl_dir_t *dd = os->os_dsl_dataset->ds_dir;
|
|
|
|
dsl_dir_cancel_waiters(dd);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2019-11-12 19:59:06 +03:00
|
|
|
#if defined(HAVE_SUPER_SETUP_BDI_NAME)
|
2011-11-08 04:39:03 +04:00
|
|
|
atomic_long_t zfs_bdi_seq = ATOMIC_LONG_INIT(0);
|
2015-02-28 03:09:52 +03:00
|
|
|
#endif
|
2011-08-02 05:24:40 +04:00
|
|
|
|
2010-12-17 22:18:08 +03:00
|
|
|
int
|
2017-03-09 03:56:09 +03:00
|
|
|
zfs_domount(struct super_block *sb, zfs_mnt_t *zm, int silent)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2017-03-09 03:56:09 +03:00
|
|
|
const char *osname = zm->mnt_osname;
|
2020-12-11 02:23:26 +03:00
|
|
|
struct inode *root_inode = NULL;
|
2011-02-08 22:16:06 +03:00
|
|
|
uint64_t recordsize;
|
2017-03-09 03:56:09 +03:00
|
|
|
int error = 0;
|
2018-02-21 03:27:31 +03:00
|
|
|
zfsvfs_t *zfsvfs = NULL;
|
|
|
|
vfs_t *vfs = NULL;
|
2021-02-21 19:19:43 +03:00
|
|
|
int canwrite;
|
|
|
|
int dataset_visible_zone;
|
2017-03-09 03:56:09 +03:00
|
|
|
|
|
|
|
ASSERT(zm);
|
|
|
|
ASSERT(osname);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2021-02-21 19:19:43 +03:00
|
|
|
dataset_visible_zone = zone_dataset_visible(osname, &canwrite);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Refuse to mount a filesystem if we are in a namespace and the
|
|
|
|
* dataset is not visible or writable in that namespace.
|
|
|
|
*/
|
|
|
|
if (!INGLOBALZONE(curproc) &&
|
|
|
|
(!dataset_visible_zone || !canwrite)) {
|
|
|
|
return (SET_ERROR(EPERM));
|
|
|
|
}
|
|
|
|
|
2018-02-21 03:27:31 +03:00
|
|
|
error = zfsvfs_parse_options(zm->mnt_data, &vfs);
|
2011-02-08 22:16:06 +03:00
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
|
2021-02-21 19:19:43 +03:00
|
|
|
/*
|
|
|
|
* If a non-writable filesystem is being mounted without the
|
|
|
|
* read-only flag, pretend it was set, as done for snapshots.
|
|
|
|
*/
|
|
|
|
if (!canwrite)
|
2023-11-08 00:24:16 +03:00
|
|
|
vfs->vfs_readonly = B_TRUE;
|
2021-02-21 19:19:43 +03:00
|
|
|
|
2018-02-21 03:27:31 +03:00
|
|
|
error = zfsvfs_create(osname, vfs->vfs_readonly, &zfsvfs);
|
|
|
|
if (error) {
|
|
|
|
zfsvfs_vfs_free(vfs);
|
2017-03-09 03:56:09 +03:00
|
|
|
goto out;
|
2018-02-21 03:27:31 +03:00
|
|
|
}
|
2017-03-09 03:56:09 +03:00
|
|
|
|
2011-02-08 22:16:06 +03:00
|
|
|
if ((error = dsl_prop_get_integer(osname, "recordsize",
|
2018-02-21 03:27:31 +03:00
|
|
|
&recordsize, NULL))) {
|
|
|
|
zfsvfs_vfs_free(vfs);
|
2011-02-08 22:16:06 +03:00
|
|
|
goto out;
|
2018-02-21 03:27:31 +03:00
|
|
|
}
|
2011-02-08 22:16:06 +03:00
|
|
|
|
2018-02-21 03:27:31 +03:00
|
|
|
vfs->vfs_data = zfsvfs;
|
|
|
|
zfsvfs->z_vfs = vfs;
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_sb = sb;
|
|
|
|
sb->s_fs_info = zfsvfs;
|
2011-02-08 22:16:06 +03:00
|
|
|
sb->s_magic = ZFS_SUPER_MAGIC;
|
|
|
|
sb->s_maxbytes = MAX_LFS_FILESIZE;
|
|
|
|
sb->s_time_gran = 1;
|
|
|
|
sb->s_blocksize = recordsize;
|
|
|
|
sb->s_blocksize_bits = ilog2(recordsize);
|
2011-11-08 04:39:03 +04:00
|
|
|
|
2017-05-02 19:46:18 +03:00
|
|
|
error = -zpl_bdi_setup(sb, "zfs");
|
2011-11-08 04:39:03 +04:00
|
|
|
if (error)
|
|
|
|
goto out;
|
2011-02-08 22:16:06 +03:00
|
|
|
|
2017-05-02 19:46:18 +03:00
|
|
|
sb->s_bdi->ra_pages = 0;
|
|
|
|
|
2011-02-08 22:16:06 +03:00
|
|
|
/* Set callback operations for the file system. */
|
|
|
|
sb->s_op = &zpl_super_operations;
|
|
|
|
sb->s_xattr = zpl_xattr_handlers;
|
2011-04-28 20:35:50 +04:00
|
|
|
sb->s_export_op = &zpl_export_operations;
|
2011-02-08 22:16:06 +03:00
|
|
|
|
|
|
|
/* Set features for file system. */
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_set_fuid_feature(zfsvfs);
|
2011-02-08 22:16:06 +03:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
if (dmu_objset_is_snapshot(zfsvfs->z_os)) {
|
2011-02-08 22:16:06 +03:00
|
|
|
uint64_t pval;
|
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
atime_changed_cb(zfsvfs, B_FALSE);
|
|
|
|
readonly_changed_cb(zfsvfs, B_TRUE);
|
2013-11-01 23:26:11 +04:00
|
|
|
if ((error = dsl_prop_get_integer(osname,
|
|
|
|
"xattr", &pval, NULL)))
|
2011-02-08 22:16:06 +03:00
|
|
|
goto out;
|
2017-03-08 03:21:37 +03:00
|
|
|
xattr_changed_cb(zfsvfs, pval);
|
2013-11-01 23:26:11 +04:00
|
|
|
if ((error = dsl_prop_get_integer(osname,
|
|
|
|
"acltype", &pval, NULL)))
|
2013-10-28 20:22:15 +04:00
|
|
|
goto out;
|
2017-03-08 03:21:37 +03:00
|
|
|
acltype_changed_cb(zfsvfs, pval);
|
|
|
|
zfsvfs->z_issnap = B_TRUE;
|
|
|
|
zfsvfs->z_os->os_sync = ZFS_SYNC_DISABLED;
|
|
|
|
zfsvfs->z_snap_defer_time = jiffies;
|
|
|
|
|
|
|
|
mutex_enter(&zfsvfs->z_os->os_user_ptr_lock);
|
|
|
|
dmu_objset_set_user(zfsvfs->z_os, zfsvfs);
|
|
|
|
mutex_exit(&zfsvfs->z_os->os_user_ptr_lock);
|
2011-02-08 22:16:06 +03:00
|
|
|
} else {
|
2017-03-09 01:56:19 +03:00
|
|
|
if ((error = zfsvfs_setup(zfsvfs, B_TRUE)))
|
2016-07-19 22:02:33 +03:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2011-02-08 22:16:06 +03:00
|
|
|
/* Allocate a root inode for the filesystem. */
|
2017-03-08 03:21:37 +03:00
|
|
|
error = zfs_root(zfsvfs, &root_inode);
|
2011-02-08 22:16:06 +03:00
|
|
|
if (error) {
|
|
|
|
(void) zfs_umount(sb);
|
zfs_domount: fix double-disown of dataset / double-free of zfsvfs_t
Before this patch, in zfs_domount, if zfs_root or d_make_root fails, we
leave zfsvfs != NULL. This will lead to execution of the error handling
`if` statement at the `out` label, and hence to a call to
dmu_objset_disown and zfsvfs_free.
However, zfs_umount, which we call upon failure of zfs_root and
d_make_root already does dmu_objset_disown and zfsvfs_free.
I suppose this patch rather adds to the brittleness of this part of the
code base, but I don't want to invest more time in this right now.
To add a regression test, we'd need some kind of fault injection
facility for zfs_root or d_make_root, which doesn't exist right now.
And even then, I think that regression test would be too closely tied
to the implementation.
To repro the double-disown / double-free, do the following:
1. patch zfs_root to always return an error
2. mount a ZFS filesystem
Here's the stack trace you would see then:
VERIFY3(ds->ds_owner == tag) failed (0000000000000000 == ffff9142361e8000)
PANIC at dsl_dataset.c:1003:dsl_dataset_disown()
Showing stack for process 28332
CPU: 2 PID: 28332 Comm: zpool Tainted: G O 5.10.103-1.nutanix.el7.x86_64 #1
Call Trace:
dump_stack+0x74/0x92
spl_dumpstack+0x29/0x2b [spl]
spl_panic+0xd4/0xfc [spl]
dsl_dataset_disown+0xe9/0x150 [zfs]
dmu_objset_disown+0xd6/0x150 [zfs]
zfs_domount+0x17b/0x4b0 [zfs]
zpl_mount+0x174/0x220 [zfs]
legacy_get_tree+0x2b/0x50
vfs_get_tree+0x2a/0xc0
path_mount+0x2fa/0xa70
do_mount+0x7c/0xa0
__x64_sys_mount+0x8b/0xe0
do_syscall_64+0x38/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Christian Schwarz <christian.schwarz@nutanix.com>
Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com>
Closes #14025
2022-10-14 21:46:47 +03:00
|
|
|
zfsvfs = NULL; /* avoid double-free; first in zfs_umount */
|
2011-02-08 22:16:06 +03:00
|
|
|
goto out;
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2011-02-08 22:16:06 +03:00
|
|
|
/* Allocate a root dentry for the filesystem */
|
2012-06-06 21:08:00 +04:00
|
|
|
sb->s_root = d_make_root(root_inode);
|
2011-02-08 22:16:06 +03:00
|
|
|
if (sb->s_root == NULL) {
|
|
|
|
(void) zfs_umount(sb);
|
zfs_domount: fix double-disown of dataset / double-free of zfsvfs_t
Before this patch, in zfs_domount, if zfs_root or d_make_root fails, we
leave zfsvfs != NULL. This will lead to execution of the error handling
`if` statement at the `out` label, and hence to a call to
dmu_objset_disown and zfsvfs_free.
However, zfs_umount, which we call upon failure of zfs_root and
d_make_root already does dmu_objset_disown and zfsvfs_free.
I suppose this patch rather adds to the brittleness of this part of the
code base, but I don't want to invest more time in this right now.
To add a regression test, we'd need some kind of fault injection
facility for zfs_root or d_make_root, which doesn't exist right now.
And even then, I think that regression test would be too closely tied
to the implementation.
To repro the double-disown / double-free, do the following:
1. patch zfs_root to always return an error
2. mount a ZFS filesystem
Here's the stack trace you would see then:
VERIFY3(ds->ds_owner == tag) failed (0000000000000000 == ffff9142361e8000)
PANIC at dsl_dataset.c:1003:dsl_dataset_disown()
Showing stack for process 28332
CPU: 2 PID: 28332 Comm: zpool Tainted: G O 5.10.103-1.nutanix.el7.x86_64 #1
Call Trace:
dump_stack+0x74/0x92
spl_dumpstack+0x29/0x2b [spl]
spl_panic+0xd4/0xfc [spl]
dsl_dataset_disown+0xe9/0x150 [zfs]
dmu_objset_disown+0xd6/0x150 [zfs]
zfs_domount+0x17b/0x4b0 [zfs]
zpl_mount+0x174/0x220 [zfs]
legacy_get_tree+0x2b/0x50
vfs_get_tree+0x2a/0xc0
path_mount+0x2fa/0xa70
do_mount+0x7c/0xa0
__x64_sys_mount+0x8b/0xe0
do_syscall_64+0x38/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xa9
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Christian Schwarz <christian.schwarz@nutanix.com>
Signed-off-by: Christian Schwarz <christian.schwarz@nutanix.com>
Closes #14025
2022-10-14 21:46:47 +03:00
|
|
|
zfsvfs = NULL; /* avoid double-free; first in zfs_umount */
|
2013-03-08 22:41:28 +04:00
|
|
|
error = SET_ERROR(ENOMEM);
|
2011-02-08 22:16:06 +03:00
|
|
|
goto out;
|
|
|
|
}
|
2011-11-11 11:15:53 +04:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
if (!zfsvfs->z_issnap)
|
|
|
|
zfsctl_create(zfsvfs);
|
2015-03-18 01:07:47 +03:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_arc_prune = arc_add_prune_callback(zpl_prune_sb, sb);
|
2011-02-08 22:16:06 +03:00
|
|
|
out:
|
|
|
|
if (error) {
|
2018-02-21 03:27:31 +03:00
|
|
|
if (zfsvfs != NULL) {
|
|
|
|
dmu_objset_disown(zfsvfs->z_os, B_TRUE, zfsvfs);
|
|
|
|
zfsvfs_free(zfsvfs);
|
|
|
|
}
|
2016-07-19 22:02:33 +03:00
|
|
|
/*
|
|
|
|
* make sure we don't have dangling sb->s_fs_info which
|
|
|
|
* zfs_preumount will use.
|
|
|
|
*/
|
|
|
|
sb->s_fs_info = NULL;
|
2011-02-08 22:16:06 +03:00
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2011-02-08 22:16:06 +03:00
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2011-11-11 11:15:53 +04:00
|
|
|
/*
|
|
|
|
* Called when an unmount is requested and certain sanity checks have
|
|
|
|
* already passed. At this point no dentries or inodes have been reclaimed
|
|
|
|
* from their respective caches. We drop the extra reference on the .zfs
|
|
|
|
* control directory to allow everything to be reclaimed. All snapshots
|
|
|
|
* must already have been unmounted to reach this point.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
zfs_preumount(struct super_block *sb)
|
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs_t *zfsvfs = sb->s_fs_info;
|
2011-11-11 11:15:53 +04:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
/* zfsvfs is NULL when zfs_domount fails during mount */
|
|
|
|
if (zfsvfs) {
|
2019-02-12 21:41:15 +03:00
|
|
|
zfs_unlinked_drain_stop_wait(zfsvfs);
|
2015-04-25 02:21:13 +03:00
|
|
|
zfsctl_destroy(sb->s_fs_info);
|
2016-07-19 22:02:33 +03:00
|
|
|
/*
|
2019-12-11 22:53:57 +03:00
|
|
|
* Wait for zrele_async before entering evict_inodes in
|
2016-07-19 22:02:33 +03:00
|
|
|
* generic_shutdown_super. The reason we must finish before
|
|
|
|
* evict_inodes is when lazytime is on, or when zfs_purgedir
|
2019-12-11 22:53:57 +03:00
|
|
|
* calls zfs_zget, zrele would bump i_count from 0 to 1. This
|
2016-07-19 22:02:33 +03:00
|
|
|
* would race with the i_count check in evict_inodes. This means
|
|
|
|
* it could destroy the inode while we are still using it.
|
|
|
|
*
|
|
|
|
* We wait for two passes. xattr directories in the first pass
|
|
|
|
* may add xattr entries in zfs_purgedir, so in the second pass
|
|
|
|
* we wait for them. We don't use taskq_wait here because it is
|
|
|
|
* a pool wide taskq. Other mounted filesystems can constantly
|
2019-12-11 22:53:57 +03:00
|
|
|
* do zrele_async and there's no guarantee when taskq will be
|
2016-07-19 22:02:33 +03:00
|
|
|
* empty.
|
|
|
|
*/
|
2019-12-11 22:53:57 +03:00
|
|
|
taskq_wait_outstanding(dsl_pool_zrele_taskq(
|
2017-03-08 03:21:37 +03:00
|
|
|
dmu_objset_pool(zfsvfs->z_os)), 0);
|
2019-12-11 22:53:57 +03:00
|
|
|
taskq_wait_outstanding(dsl_pool_zrele_taskq(
|
2017-03-08 03:21:37 +03:00
|
|
|
dmu_objset_pool(zfsvfs->z_os)), 0);
|
2016-07-19 22:02:33 +03:00
|
|
|
}
|
2011-11-11 11:15:53 +04:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Called once all other unmount released tear down has occurred.
|
|
|
|
* It is our responsibility to release any remaining infrastructure.
|
|
|
|
*/
|
2011-02-08 22:16:06 +03:00
|
|
|
int
|
|
|
|
zfs_umount(struct super_block *sb)
|
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs_t *zfsvfs = sb->s_fs_info;
|
2011-02-08 22:16:06 +03:00
|
|
|
objset_t *os;
|
|
|
|
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
if (zfsvfs->z_arc_prune != NULL)
|
|
|
|
arc_remove_prune_callback(zfsvfs->z_arc_prune);
|
2017-03-08 03:21:37 +03:00
|
|
|
VERIFY(zfsvfs_teardown(zfsvfs, B_TRUE) == 0);
|
|
|
|
os = zfsvfs->z_os;
|
2017-05-02 19:46:18 +03:00
|
|
|
zpl_bdi_destroy(sb);
|
2011-08-02 05:24:40 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
|
|
|
* z_os will be NULL if there was an error in
|
2017-03-08 03:21:37 +03:00
|
|
|
* attempting to reopen zfsvfs.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
if (os != NULL) {
|
|
|
|
/*
|
|
|
|
* Unset the objset user_ptr.
|
|
|
|
*/
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_enter(&os->os_user_ptr_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_objset_set_user(os, NULL);
|
2010-05-29 00:45:14 +04:00
|
|
|
mutex_exit(&os->os_user_ptr_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
|
|
|
/*
|
2008-12-03 23:09:06 +03:00
|
|
|
* Finally release the objset
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
Native Encryption for ZFS on Linux
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494
Closes #5769
2017-08-14 20:36:48 +03:00
|
|
|
dmu_objset_disown(os, B_TRUE, zfsvfs);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-03-09 01:56:19 +03:00
|
|
|
zfsvfs_free(zfsvfs);
|
2023-07-20 20:30:21 +03:00
|
|
|
sb->s_fs_info = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2011-03-15 22:41:19 +03:00
|
|
|
int
|
2017-03-09 03:56:09 +03:00
|
|
|
zfs_remount(struct super_block *sb, int *flags, zfs_mnt_t *zm)
|
2011-03-15 22:41:19 +03:00
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs_t *zfsvfs = sb->s_fs_info;
|
2017-03-09 03:56:09 +03:00
|
|
|
vfs_t *vfsp;
|
2017-08-18 00:28:17 +03:00
|
|
|
boolean_t issnap = dmu_objset_is_snapshot(zfsvfs->z_os);
|
2015-09-01 02:46:01 +03:00
|
|
|
int error;
|
|
|
|
|
2017-08-18 00:28:17 +03:00
|
|
|
if ((issnap || !spa_writeable(dmu_objset_spa(zfsvfs->z_os))) &&
|
2019-01-11 02:28:44 +03:00
|
|
|
!(*flags & SB_RDONLY)) {
|
|
|
|
*flags |= SB_RDONLY;
|
2017-08-18 00:28:17 +03:00
|
|
|
return (EROFS);
|
|
|
|
}
|
|
|
|
|
2017-03-09 03:56:09 +03:00
|
|
|
error = zfsvfs_parse_options(zm->mnt_data, &vfsp);
|
|
|
|
if (error)
|
|
|
|
return (error);
|
|
|
|
|
2019-01-11 02:28:44 +03:00
|
|
|
if (!zfs_is_readonly(zfsvfs) && (*flags & SB_RDONLY))
|
Always wait for txg sync when umounting dataset
Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled.
While looking into fixes for this issue, it was discovered that
dsl_dataset_is_dirty() does not return B_TRUE when the dataset has
been removed from the txg dirty datasets list, but has not actually
been processed yet. Furthermore, the implementation is comletely
different from dmu_objset_is_dirty(), adding to the confusion.
Rather than relying on this function, this patch forces the umount
code path (and the remount readonly code path) to always perform a
txg sync on read-write datasets and removes the function altogether.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7753
Closes #7795
2018-08-20 23:42:17 +03:00
|
|
|
txg_wait_synced(dmu_objset_pool(zfsvfs->z_os), 0);
|
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_unregister_callbacks(zfsvfs);
|
2017-03-09 03:56:09 +03:00
|
|
|
zfsvfs_vfs_free(zfsvfs->z_vfs);
|
|
|
|
|
|
|
|
vfsp->vfs_data = zfsvfs;
|
|
|
|
zfsvfs->z_vfs = vfsp;
|
2017-08-18 00:28:17 +03:00
|
|
|
if (!issnap)
|
|
|
|
(void) zfs_register_callbacks(vfsp);
|
2015-09-01 02:46:01 +03:00
|
|
|
|
|
|
|
return (error);
|
2011-03-15 22:41:19 +03:00
|
|
|
}
|
|
|
|
|
2010-12-17 22:18:08 +03:00
|
|
|
int
|
2011-05-19 22:44:07 +04:00
|
|
|
zfs_vget(struct super_block *sb, struct inode **ipp, fid_t *fidp)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs_t *zfsvfs = sb->s_fs_info;
|
2008-11-20 23:01:55 +03:00
|
|
|
znode_t *zp;
|
|
|
|
uint64_t object = 0;
|
|
|
|
uint64_t fid_gen = 0;
|
|
|
|
uint64_t gen_mask;
|
|
|
|
uint64_t zp_gen;
|
2011-02-08 22:16:06 +03:00
|
|
|
int i, err;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2011-02-08 22:16:06 +03:00
|
|
|
*ipp = NULL;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-03-08 20:26:33 +03:00
|
|
|
if (fidp->fid_len == SHORT_FID_LEN || fidp->fid_len == LONG_FID_LEN) {
|
|
|
|
zfid_short_t *zfid = (zfid_short_t *)fidp;
|
|
|
|
|
|
|
|
for (i = 0; i < sizeof (zfid->zf_object); i++)
|
|
|
|
object |= ((uint64_t)zfid->zf_object[i]) << (8 * i);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-03-08 20:26:33 +03:00
|
|
|
for (i = 0; i < sizeof (zfid->zf_gen); i++)
|
|
|
|
fid_gen |= ((uint64_t)zfid->zf_gen[i]) << (8 * i);
|
|
|
|
} else {
|
|
|
|
return (SET_ERROR(EINVAL));
|
|
|
|
}
|
|
|
|
|
|
|
|
/* LONG_FID_LEN means snapdirs */
|
2008-11-20 23:01:55 +03:00
|
|
|
if (fidp->fid_len == LONG_FID_LEN) {
|
|
|
|
zfid_long_t *zlfid = (zfid_long_t *)fidp;
|
|
|
|
uint64_t objsetid = 0;
|
|
|
|
uint64_t setgen = 0;
|
|
|
|
|
|
|
|
for (i = 0; i < sizeof (zlfid->zf_setid); i++)
|
|
|
|
objsetid |= ((uint64_t)zlfid->zf_setid[i]) << (8 * i);
|
|
|
|
|
|
|
|
for (i = 0; i < sizeof (zlfid->zf_setgen); i++)
|
|
|
|
setgen |= ((uint64_t)zlfid->zf_setgen[i]) << (8 * i);
|
|
|
|
|
2017-03-08 20:26:33 +03:00
|
|
|
if (objsetid != ZFSCTL_INO_SNAPDIRS - object) {
|
|
|
|
dprintf("snapdir fid: objsetid (%llu) != "
|
|
|
|
"ZFSCTL_INO_SNAPDIRS (%llu) - object (%llu)\n",
|
|
|
|
objsetid, ZFSCTL_INO_SNAPDIRS, object);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EINVAL));
|
2017-03-08 20:26:33 +03:00
|
|
|
}
|
2011-11-11 11:15:53 +04:00
|
|
|
|
2017-03-08 20:26:33 +03:00
|
|
|
if (fid_gen > 1 || setgen != 0) {
|
|
|
|
dprintf("snapdir fid: fid_gen (%llu) and setgen "
|
|
|
|
"(%llu)\n", fid_gen, setgen);
|
|
|
|
return (SET_ERROR(EINVAL));
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-03-08 20:26:33 +03:00
|
|
|
return (zfsctl_snapdir_vget(sb, objsetid, fid_gen, ipp));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2022-09-16 23:36:47 +03:00
|
|
|
if ((err = zfs_enter(zfsvfs, FTAG)) != 0)
|
|
|
|
return (err);
|
2008-11-20 23:01:55 +03:00
|
|
|
/* A zero fid_gen means we are in the .zfs control directories */
|
|
|
|
if (fid_gen == 0 &&
|
|
|
|
(object == ZFSCTL_INO_ROOT || object == ZFSCTL_INO_SNAPDIR)) {
|
2017-03-08 03:21:37 +03:00
|
|
|
*ipp = zfsvfs->z_ctldir;
|
2011-02-08 22:16:06 +03:00
|
|
|
ASSERT(*ipp != NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
if (object == ZFSCTL_INO_SNAPDIR) {
|
2011-11-11 11:15:53 +04:00
|
|
|
VERIFY(zfsctl_root_lookup(*ipp, "snapshot", ipp,
|
|
|
|
0, kcred, NULL, NULL) == 0);
|
2008-11-20 23:01:55 +03:00
|
|
|
} else {
|
2021-03-17 02:33:34 +03:00
|
|
|
/*
|
|
|
|
* Must have an existing ref, so igrab()
|
|
|
|
* cannot return NULL
|
|
|
|
*/
|
|
|
|
VERIFY3P(igrab(*ipp), !=, NULL);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2022-09-16 23:36:47 +03:00
|
|
|
zfs_exit(zfsvfs, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
gen_mask = -1ULL >> (64 - 8 * i);
|
|
|
|
|
2014-10-24 03:59:27 +04:00
|
|
|
dprintf("getting %llu [%llu mask %llx]\n", object, fid_gen, gen_mask);
|
2017-03-08 03:21:37 +03:00
|
|
|
if ((err = zfs_zget(zfsvfs, object, &zp))) {
|
2022-09-16 23:36:47 +03:00
|
|
|
zfs_exit(zfsvfs, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (err);
|
|
|
|
}
|
2016-07-07 02:54:19 +03:00
|
|
|
|
|
|
|
/* Don't export xattr stuff */
|
|
|
|
if (zp->z_pflags & ZFS_XATTR) {
|
2019-12-11 22:53:57 +03:00
|
|
|
zrele(zp);
|
2022-09-16 23:36:47 +03:00
|
|
|
zfs_exit(zfsvfs, FTAG);
|
2016-07-07 02:54:19 +03:00
|
|
|
return (SET_ERROR(ENOENT));
|
|
|
|
}
|
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
(void) sa_lookup(zp->z_sa_hdl, SA_ZPL_GEN(zfsvfs), &zp_gen,
|
2010-05-29 00:45:14 +04:00
|
|
|
sizeof (uint64_t));
|
|
|
|
zp_gen = zp_gen & gen_mask;
|
2008-11-20 23:01:55 +03:00
|
|
|
if (zp_gen == 0)
|
|
|
|
zp_gen = 1;
|
2017-03-08 03:21:37 +03:00
|
|
|
if ((fid_gen == 0) && (zfsvfs->z_root == object))
|
2015-08-29 00:54:32 +03:00
|
|
|
fid_gen = zp_gen;
|
2008-11-20 23:01:55 +03:00
|
|
|
if (zp->z_unlinked || zp_gen != fid_gen) {
|
2014-10-24 03:59:27 +04:00
|
|
|
dprintf("znode gen (%llu) != fid gen (%llu)\n", zp_gen,
|
|
|
|
fid_gen);
|
2019-12-11 22:53:57 +03:00
|
|
|
zrele(zp);
|
2022-09-16 23:36:47 +03:00
|
|
|
zfs_exit(zfsvfs, FTAG);
|
2016-07-09 00:51:42 +03:00
|
|
|
return (SET_ERROR(ENOENT));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2011-02-08 22:16:06 +03:00
|
|
|
*ipp = ZTOI(zp);
|
|
|
|
if (*ipp)
|
2021-02-09 22:17:29 +03:00
|
|
|
zfs_znode_update_vfs(ITOZ(*ipp));
|
2011-01-06 01:27:30 +03:00
|
|
|
|
2022-09-16 23:36:47 +03:00
|
|
|
zfs_exit(zfsvfs, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2017-03-08 03:21:37 +03:00
|
|
|
* Block out VFS ops and close zfsvfs_t
|
2008-11-20 23:01:55 +03:00
|
|
|
*
|
|
|
|
* Note, if successful, then we return with the 'z_teardown_lock' and
|
2013-07-27 21:50:07 +04:00
|
|
|
* 'z_teardown_inactive_lock' write held. We leave ownership of the underlying
|
|
|
|
* dataset and objset intact so that they can be atomically handed off during
|
|
|
|
* a subsequent rollback or recv operation and the resume thereafter.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
int
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_suspend_fs(zfsvfs_t *zfsvfs)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int error;
|
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
if ((error = zfsvfs_teardown(zfsvfs, B_FALSE)) != 0)
|
2008-11-20 23:01:55 +03:00
|
|
|
return (error);
|
2013-01-16 04:41:09 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2016-05-11 06:49:02 +03:00
|
|
|
* Rebuild SA and release VOPs. Note that ownership of the underlying dataset
|
|
|
|
* is an invariant across any of the operations that can be performed while the
|
|
|
|
* filesystem was suspended. Whether it succeeded or failed, the preconditions
|
|
|
|
* are the same: the relevant objset and associated dataset are owned by
|
|
|
|
* zfsvfs, held, and long held on entry.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
|
|
|
int
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_resume_fs(zfsvfs_t *zfsvfs, dsl_dataset_t *ds)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
2013-11-10 05:22:06 +04:00
|
|
|
int err, err2;
|
2013-07-27 21:50:07 +04:00
|
|
|
znode_t *zp;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-11-05 01:23:48 +03:00
|
|
|
ASSERT(ZFS_TEARDOWN_WRITE_HELD(zfsvfs));
|
2017-03-08 03:21:37 +03:00
|
|
|
ASSERT(RW_WRITE_HELD(&zfsvfs->z_teardown_inactive_lock));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-07-27 21:50:07 +04:00
|
|
|
/*
|
2017-01-23 21:53:46 +03:00
|
|
|
* We already own this, so just update the objset_t, as the one we
|
|
|
|
* had before may have been evicted.
|
2013-07-27 21:50:07 +04:00
|
|
|
*/
|
2016-05-11 06:49:02 +03:00
|
|
|
objset_t *os;
|
2017-03-08 03:21:37 +03:00
|
|
|
VERIFY3P(ds->ds_owner, ==, zfsvfs);
|
2017-01-23 21:53:46 +03:00
|
|
|
VERIFY(dsl_dataset_long_held(ds));
|
dmu_objset_from_ds must be called with dp_config_rwlock held
The normal lock order is that the dp_config_rwlock must be held before
the ds_opening_lock. For example, dmu_objset_hold() does this.
However, dmu_objset_open_impl() is called with the ds_opening_lock held,
and if the dp_config_rwlock is not already held, it will attempt to
acquire it. This may lead to deadlock, since the lock order is
reversed.
Looking at all the callers of dmu_objset_open_impl() (which is
principally the callers of dmu_objset_from_ds()), almost all callers
already have the dp_config_rwlock. However, there are a few places in
the send and receive code paths that do not. For example:
dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream,
receive_write_byref, redact_traverse_thread.
This commit resolves the problem by requiring all callers ot
dmu_objset_from_ds() to hold the dp_config_rwlock. In most cases, the
code has been restructured such that we call dmu_objset_from_ds()
earlier on in the send and receive processes, when we already have the
dp_config_rwlock, and save the objset_t until we need it in the middle
of the send or receive (similar to what we already do with the
dsl_dataset_t). Thus we do not need to acquire the dp_config_rwlock in
many new places.
I also cleaned up code in dmu_redact_snap() and send_traverse_thread().
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9662
Closes #10115
2020-03-12 20:55:02 +03:00
|
|
|
dsl_pool_t *dp = spa_get_dsl(dsl_dataset_get_spa(ds));
|
|
|
|
dsl_pool_config_enter(dp, FTAG);
|
2016-05-11 06:49:02 +03:00
|
|
|
VERIFY0(dmu_objset_from_ds(ds, &os));
|
dmu_objset_from_ds must be called with dp_config_rwlock held
The normal lock order is that the dp_config_rwlock must be held before
the ds_opening_lock. For example, dmu_objset_hold() does this.
However, dmu_objset_open_impl() is called with the ds_opening_lock held,
and if the dp_config_rwlock is not already held, it will attempt to
acquire it. This may lead to deadlock, since the lock order is
reversed.
Looking at all the callers of dmu_objset_open_impl() (which is
principally the callers of dmu_objset_from_ds()), almost all callers
already have the dp_config_rwlock. However, there are a few places in
the send and receive code paths that do not. For example:
dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream,
receive_write_byref, redact_traverse_thread.
This commit resolves the problem by requiring all callers ot
dmu_objset_from_ds() to hold the dp_config_rwlock. In most cases, the
code has been restructured such that we call dmu_objset_from_ds()
earlier on in the send and receive processes, when we already have the
dp_config_rwlock, and save the objset_t until we need it in the middle
of the send or receive (similar to what we already do with the
dsl_dataset_t). Thus we do not need to acquire the dp_config_rwlock in
many new places.
I also cleaned up code in dmu_redact_snap() and send_traverse_thread().
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9662
Closes #10115
2020-03-12 20:55:02 +03:00
|
|
|
dsl_pool_config_exit(dp, FTAG);
|
2010-08-18 23:59:31 +04:00
|
|
|
|
2016-05-11 06:49:02 +03:00
|
|
|
err = zfsvfs_init(zfsvfs, os);
|
|
|
|
if (err != 0)
|
2013-07-27 21:50:07 +04:00
|
|
|
goto bail;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2020-04-01 20:02:06 +03:00
|
|
|
ds->ds_dir->dd_activity_cancelled = B_FALSE;
|
2017-03-09 01:56:19 +03:00
|
|
|
VERIFY(zfsvfs_setup(zfsvfs, B_FALSE) == 0);
|
2010-08-18 23:59:31 +04:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_set_fuid_feature(zfsvfs);
|
|
|
|
zfsvfs->z_rollback_time = jiffies;
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2013-07-27 21:50:07 +04:00
|
|
|
/*
|
|
|
|
* Attempt to re-establish all the active inodes with their
|
|
|
|
* dbufs. If a zfs_rezget() fails, then we unhash the inode
|
|
|
|
* and mark it stale. This prevents a collision if a new
|
|
|
|
* inode/object is created which must use the same inode
|
|
|
|
* number. The stale inode will be be released when the
|
|
|
|
* VFS prunes the dentry holding the remaining references
|
|
|
|
* on the stale inode.
|
|
|
|
*/
|
2017-03-08 03:21:37 +03:00
|
|
|
mutex_enter(&zfsvfs->z_znodes_lock);
|
|
|
|
for (zp = list_head(&zfsvfs->z_all_znodes); zp;
|
|
|
|
zp = list_next(&zfsvfs->z_all_znodes, zp)) {
|
2013-11-10 05:22:06 +04:00
|
|
|
err2 = zfs_rezget(zp);
|
|
|
|
if (err2) {
|
2019-12-05 03:52:27 +03:00
|
|
|
zpl_d_drop_aliases(ZTOI(zp));
|
2013-07-27 21:50:07 +04:00
|
|
|
remove_inode_hash(ZTOI(zp));
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2019-08-27 19:55:51 +03:00
|
|
|
|
|
|
|
/* see comment in zfs_suspend_fs() */
|
|
|
|
if (zp->z_suspended) {
|
2019-12-11 22:53:57 +03:00
|
|
|
zfs_zrele_async(zp);
|
2019-08-27 19:55:51 +03:00
|
|
|
zp->z_suspended = B_FALSE;
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
2017-03-08 03:21:37 +03:00
|
|
|
mutex_exit(&zfsvfs->z_znodes_lock);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2019-02-12 21:41:15 +03:00
|
|
|
if (!zfs_is_readonly(zfsvfs) && !zfsvfs->z_unmounted) {
|
|
|
|
/*
|
|
|
|
* zfs_suspend_fs() could have interrupted freeing
|
|
|
|
* of dnodes. We need to restart this freeing so
|
|
|
|
* that we don't "leak" the space.
|
|
|
|
*/
|
|
|
|
zfs_unlinked_drain(zfsvfs);
|
|
|
|
}
|
|
|
|
|
2019-11-11 20:34:21 +03:00
|
|
|
/*
|
|
|
|
* Most of the time zfs_suspend_fs is used for changing the contents
|
|
|
|
* of the underlying dataset. ZFS rollback and receive operations
|
|
|
|
* might create files for which negative dentries are present in
|
|
|
|
* the cache. Since walking the dcache would require a lot of GPL-only
|
|
|
|
* code duplication, it's much easier on these rather rare occasions
|
|
|
|
* just to flush the whole dcache for the given dataset/filesystem.
|
|
|
|
*/
|
|
|
|
shrink_dcache_sb(zfsvfs->z_sb);
|
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
bail:
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
if (err != 0)
|
|
|
|
zfsvfs->z_unmounted = B_TRUE;
|
|
|
|
|
2013-01-16 04:41:09 +04:00
|
|
|
/* release the VFS ops */
|
2017-03-08 03:21:37 +03:00
|
|
|
rw_exit(&zfsvfs->z_teardown_inactive_lock);
|
2020-11-05 01:23:48 +03:00
|
|
|
ZFS_TEARDOWN_EXIT(zfsvfs, FTAG);
|
2008-11-20 23:01:55 +03:00
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
if (err != 0) {
|
2008-11-20 23:01:55 +03:00
|
|
|
/*
|
2013-07-27 21:50:07 +04:00
|
|
|
* Since we couldn't setup the sa framework, try to force
|
|
|
|
* unmount this file system.
|
2008-11-20 23:01:55 +03:00
|
|
|
*/
|
2017-03-08 03:21:37 +03:00
|
|
|
if (zfsvfs->z_os)
|
|
|
|
(void) zfs_umount(zfsvfs->z_sb);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
/*
|
|
|
|
* Release VOPs and unmount a suspended filesystem.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
zfs_end_fs(zfsvfs_t *zfsvfs, dsl_dataset_t *ds)
|
|
|
|
{
|
2020-11-05 01:23:48 +03:00
|
|
|
ASSERT(ZFS_TEARDOWN_WRITE_HELD(zfsvfs));
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
ASSERT(RW_WRITE_HELD(&zfsvfs->z_teardown_inactive_lock));
|
|
|
|
|
|
|
|
/*
|
|
|
|
* We already own this, so just hold and rele it to update the
|
|
|
|
* objset_t, as the one we had before may have been evicted.
|
|
|
|
*/
|
|
|
|
objset_t *os;
|
|
|
|
VERIFY3P(ds->ds_owner, ==, zfsvfs);
|
|
|
|
VERIFY(dsl_dataset_long_held(ds));
|
dmu_objset_from_ds must be called with dp_config_rwlock held
The normal lock order is that the dp_config_rwlock must be held before
the ds_opening_lock. For example, dmu_objset_hold() does this.
However, dmu_objset_open_impl() is called with the ds_opening_lock held,
and if the dp_config_rwlock is not already held, it will attempt to
acquire it. This may lead to deadlock, since the lock order is
reversed.
Looking at all the callers of dmu_objset_open_impl() (which is
principally the callers of dmu_objset_from_ds()), almost all callers
already have the dp_config_rwlock. However, there are a few places in
the send and receive code paths that do not. For example:
dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream,
receive_write_byref, redact_traverse_thread.
This commit resolves the problem by requiring all callers ot
dmu_objset_from_ds() to hold the dp_config_rwlock. In most cases, the
code has been restructured such that we call dmu_objset_from_ds()
earlier on in the send and receive processes, when we already have the
dp_config_rwlock, and save the objset_t until we need it in the middle
of the send or receive (similar to what we already do with the
dsl_dataset_t). Thus we do not need to acquire the dp_config_rwlock in
many new places.
I also cleaned up code in dmu_redact_snap() and send_traverse_thread().
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9662
Closes #10115
2020-03-12 20:55:02 +03:00
|
|
|
dsl_pool_t *dp = spa_get_dsl(dsl_dataset_get_spa(ds));
|
|
|
|
dsl_pool_config_enter(dp, FTAG);
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
VERIFY0(dmu_objset_from_ds(ds, &os));
|
dmu_objset_from_ds must be called with dp_config_rwlock held
The normal lock order is that the dp_config_rwlock must be held before
the ds_opening_lock. For example, dmu_objset_hold() does this.
However, dmu_objset_open_impl() is called with the ds_opening_lock held,
and if the dp_config_rwlock is not already held, it will attempt to
acquire it. This may lead to deadlock, since the lock order is
reversed.
Looking at all the callers of dmu_objset_open_impl() (which is
principally the callers of dmu_objset_from_ds()), almost all callers
already have the dp_config_rwlock. However, there are a few places in
the send and receive code paths that do not. For example:
dsl_crypto_populate_key_nvlist, send_cb, dmu_recv_stream,
receive_write_byref, redact_traverse_thread.
This commit resolves the problem by requiring all callers ot
dmu_objset_from_ds() to hold the dp_config_rwlock. In most cases, the
code has been restructured such that we call dmu_objset_from_ds()
earlier on in the send and receive processes, when we already have the
dp_config_rwlock, and save the objset_t until we need it in the middle
of the send or receive (similar to what we already do with the
dsl_dataset_t). Thus we do not need to acquire the dp_config_rwlock in
many new places.
I also cleaned up code in dmu_redact_snap() and send_traverse_thread().
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9662
Closes #10115
2020-03-12 20:55:02 +03:00
|
|
|
dsl_pool_config_exit(dp, FTAG);
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
zfsvfs->z_os = os;
|
|
|
|
|
|
|
|
/* release the VOPs */
|
|
|
|
rw_exit(&zfsvfs->z_teardown_inactive_lock);
|
2020-11-05 01:23:48 +03:00
|
|
|
ZFS_TEARDOWN_EXIT(zfsvfs, FTAG);
|
Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to
a target system. One possible use case for this feature is to not
transmit sensitive information to a data warehousing, test/dev, or
analytics environment. Another is to save space by not replicating
unimportant data within a given dataset, for example in backup tools
like zrepl.
Redacted send/receive is a three-stage process. First, a clone (or
clones) is made of the snapshot to be sent to the target. In this
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction
snapshot" (or snapshots). Second, the new zfs redact command is used
to create a redaction bookmark. The redaction bookmark stores the
list of blocks in a snapshot that were modified by the redaction
snapshot(s). Finally, the redaction bookmark is passed as a parameter
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive
or unwanted information, and those blocks are not included in the send
stream. When sending from the redaction bookmark, the blocks it
contains are considered as candidate blocks in addition to those
blocks in the destination snapshot that were modified since the
creation_txg of the redaction bookmark. This step is necessary to
allow the target to rehydrate data in the case where some blocks are
accidentally or unnecessarily modified in the redaction snapshot.
The changes to bookmarks to enable fast space estimation involve
adding deadlists to bookmarks. There is also logic to manage the
life cycles of these deadlists.
The new size estimation process operates in cases where previously
an accurate estimate could not be provided. In those cases, a send
is performed where no data blocks are read, reducing the runtime
significantly and providing a byte-accurate size estimate.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 19:48:13 +03:00
|
|
|
|
|
|
|
/*
|
|
|
|
* Try to force unmount this file system.
|
|
|
|
*/
|
|
|
|
(void) zfs_umount(zfsvfs->z_sb);
|
|
|
|
zfsvfs->z_unmounted = B_TRUE;
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
2019-11-11 20:34:21 +03:00
|
|
|
/*
|
|
|
|
* Automounted snapshots rely on periodic revalidation
|
|
|
|
* to defer snapshots from being automatically unmounted.
|
|
|
|
*/
|
|
|
|
|
|
|
|
inline void
|
|
|
|
zfs_exit_fs(zfsvfs_t *zfsvfs)
|
|
|
|
{
|
|
|
|
if (!zfsvfs->z_issnap)
|
|
|
|
return;
|
|
|
|
|
|
|
|
if (time_after(jiffies, zfsvfs->z_snap_defer_time +
|
|
|
|
MAX(zfs_expire_snapshot * HZ / 2, HZ))) {
|
|
|
|
zfsvfs->z_snap_defer_time = jiffies;
|
|
|
|
zfsctl_snapshot_unmount_delay(zfsvfs->z_os->os_spa,
|
|
|
|
dmu_objset_id(zfsvfs->z_os),
|
|
|
|
zfs_expire_snapshot);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
int
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_set_version(zfsvfs_t *zfsvfs, uint64_t newvers)
|
2008-11-20 23:01:55 +03:00
|
|
|
{
|
|
|
|
int error;
|
2017-03-08 03:21:37 +03:00
|
|
|
objset_t *os = zfsvfs->z_os;
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_tx_t *tx;
|
|
|
|
|
|
|
|
if (newvers < ZPL_VERSION_INITIAL || newvers > ZPL_VERSION)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EINVAL));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
if (newvers < zfsvfs->z_version)
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(EINVAL));
|
2008-11-20 23:01:55 +03:00
|
|
|
|
2010-05-29 00:45:14 +04:00
|
|
|
if (zfs_spa_version_map(newvers) >
|
2017-03-08 03:21:37 +03:00
|
|
|
spa_version(dmu_objset_spa(zfsvfs->z_os)))
|
2013-03-08 22:41:28 +04:00
|
|
|
return (SET_ERROR(ENOTSUP));
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
tx = dmu_tx_create(os);
|
2009-07-03 02:44:48 +04:00
|
|
|
dmu_tx_hold_zap(tx, MASTER_NODE_OBJ, B_FALSE, ZPL_VERSION_STR);
|
2017-03-08 03:21:37 +03:00
|
|
|
if (newvers >= ZPL_VERSION_SA && !zfsvfs->z_use_sa) {
|
2010-05-29 00:45:14 +04:00
|
|
|
dmu_tx_hold_zap(tx, MASTER_NODE_OBJ, B_TRUE,
|
|
|
|
ZFS_SA_ATTRS);
|
|
|
|
dmu_tx_hold_zap(tx, DMU_NEW_OBJECT, FALSE, NULL);
|
|
|
|
}
|
2008-11-20 23:01:55 +03:00
|
|
|
error = dmu_tx_assign(tx, TXG_WAIT);
|
|
|
|
if (error) {
|
|
|
|
dmu_tx_abort(tx);
|
2009-07-03 02:44:48 +04:00
|
|
|
return (error);
|
|
|
|
}
|
2010-05-29 00:45:14 +04:00
|
|
|
|
2009-07-03 02:44:48 +04:00
|
|
|
error = zap_update(os, MASTER_NODE_OBJ, ZPL_VERSION_STR,
|
|
|
|
8, 1, &newvers, tx);
|
|
|
|
|
|
|
|
if (error) {
|
|
|
|
dmu_tx_commit(tx);
|
|
|
|
return (error);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
if (newvers >= ZPL_VERSION_SA && !zfsvfs->z_use_sa) {
|
2010-05-29 00:45:14 +04:00
|
|
|
uint64_t sa_obj;
|
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
ASSERT3U(spa_version(dmu_objset_spa(zfsvfs->z_os)), >=,
|
2010-05-29 00:45:14 +04:00
|
|
|
SPA_VERSION_SA);
|
|
|
|
sa_obj = zap_create(os, DMU_OT_SA_MASTER_NODE,
|
|
|
|
DMU_OT_NONE, 0, tx);
|
|
|
|
|
|
|
|
error = zap_add(os, MASTER_NODE_OBJ,
|
|
|
|
ZFS_SA_ATTRS, 8, 1, &sa_obj, tx);
|
2013-05-11 01:17:03 +04:00
|
|
|
ASSERT0(error);
|
2010-05-29 00:45:14 +04:00
|
|
|
|
|
|
|
VERIFY(0 == sa_set_sa_object(os, sa_obj));
|
|
|
|
sa_register_update_callback(os, zfs_sa_upgrade);
|
|
|
|
}
|
|
|
|
|
2013-08-28 15:45:09 +04:00
|
|
|
spa_history_log_internal_ds(dmu_objset_ds(os), "upgrade", tx,
|
2017-03-08 03:21:37 +03:00
|
|
|
"from %llu to %llu", zfsvfs->z_version, newvers);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2008-11-20 23:01:55 +03:00
|
|
|
dmu_tx_commit(tx);
|
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs->z_version = newvers;
|
2018-07-10 20:49:50 +03:00
|
|
|
os->os_version = newvers;
|
2009-07-03 02:44:48 +04:00
|
|
|
|
2017-03-08 03:21:37 +03:00
|
|
|
zfs_set_fuid_feature(zfsvfs);
|
2009-07-03 02:44:48 +04:00
|
|
|
|
|
|
|
return (0);
|
2008-11-20 23:01:55 +03:00
|
|
|
}
|
|
|
|
|
2017-01-27 22:46:39 +03:00
|
|
|
/*
|
2019-09-03 03:56:41 +03:00
|
|
|
* Return true if the corresponding vfs's unmounted flag is set.
|
2017-01-27 22:46:39 +03:00
|
|
|
* Otherwise return false.
|
|
|
|
* If this function returns true we know VFS unmount has been initiated.
|
|
|
|
*/
|
|
|
|
boolean_t
|
|
|
|
zfs_get_vfs_flag_unmounted(objset_t *os)
|
|
|
|
{
|
2017-03-08 03:21:37 +03:00
|
|
|
zfsvfs_t *zfvp;
|
2017-01-27 22:46:39 +03:00
|
|
|
boolean_t unmounted = B_FALSE;
|
|
|
|
|
|
|
|
ASSERT(dmu_objset_type(os) == DMU_OST_ZFS);
|
|
|
|
|
|
|
|
mutex_enter(&os->os_user_ptr_lock);
|
|
|
|
zfvp = dmu_objset_get_user(os);
|
|
|
|
if (zfvp != NULL && zfvp->z_unmounted)
|
|
|
|
unmounted = B_TRUE;
|
|
|
|
mutex_exit(&os->os_user_ptr_lock);
|
|
|
|
|
|
|
|
return (unmounted);
|
|
|
|
}
|
|
|
|
|
2020-09-02 02:14:16 +03:00
|
|
|
void
|
|
|
|
zfsvfs_update_fromname(const char *oldname, const char *newname)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* We don't need to do anything here, the devname is always current by
|
|
|
|
* virtue of zfsvfs->z_sb->s_op->show_devname.
|
|
|
|
*/
|
2022-02-16 04:38:43 +03:00
|
|
|
(void) oldname, (void) newname;
|
2020-09-02 02:14:16 +03:00
|
|
|
}
|
|
|
|
|
2011-02-08 22:16:06 +03:00
|
|
|
void
|
|
|
|
zfs_init(void)
|
|
|
|
{
|
2011-11-11 11:15:53 +04:00
|
|
|
zfsctl_init();
|
2011-02-08 22:16:06 +03:00
|
|
|
zfs_znode_init();
|
File incorrectly zeroed when receiving incremental stream that toggles -L
Background:
By increasing the recordsize property above the default of 128KB, a
filesystem may have "large" blocks. By default, a send stream of such a
filesystem does not contain large WRITE records, instead it decreases
objects' block sizes to 128KB and splits the large blocks into 128KB
blocks, allowing the large-block filesystem to be received by a system
that does not support the `large_blocks` feature. A send stream
generated by `zfs send -L` (or `--large-block`) preserves the large
block size on the receiving system, by using large WRITE records.
When receiving an incremental send stream for a filesystem with large
blocks, if the send stream's -L flag was toggled, a bug is encountered
in which the file's contents are incorrectly zeroed out. The contents
of any blocks that were not modified by this send stream will be lost.
"Toggled" means that the previous send used `-L`, but this incremental
does not use `-L` (-L to no-L); or that the previous send did not use
`-L`, but this incremental does use `-L` (no-L to -L).
Changes:
This commit addresses the problem with several changes to the semantics
of zfs send/receive:
1. "-L to no-L" incrementals are rejected. If the previous send used
`-L`, but this incremental does not use `-L`, the `zfs receive` will
fail with this error message:
incremental send stream requires -L (--large-block), to match
previous receive.
2. "no-L to -L" incrementals are handled correctly, preserving the
smaller (128KB) block size of any already-received files that used large
blocks on the sending system but were split by `zfs send` without the
`-L` flag.
3. A new send stream format flag is added, `SWITCH_TO_LARGE_BLOCKS`.
This feature indicates that we can correctly handle "no-L to -L"
incrementals. This flag is currently not set on any send streams. In
the future, we intend for incremental send streams of snapshots that
have large blocks to use `-L` by default, and these streams will also
have the `SWITCH_TO_LARGE_BLOCKS` feature set. This ensures that streams
from the default use of `zfs send` won't encounter the bug mentioned
above, because they can't be received by software with the bug.
Implementation notes:
To facilitate accessing the ZPL's generation number,
`zfs_space_delta_cb()` has been renamed to `zpl_get_file_info()` and
restructured to fill in a struct with ZPL-specific info including owner
and generation.
In the "no-L to -L" case, if this is a compressed send stream (from
`zfs send -cL`), large WRITE records that are being written to small
(128KB) blocksize files need to be decompressed so that they can be
written split up into multiple blocks. The zio pipeline will recompress
each smaller block individually.
A new test case, `send-L_toggle`, is added, which tests the "no-L to -L"
case and verifies that we get an error for the "-L to no-L" case.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #6224
Closes #10383
2020-06-09 20:41:01 +03:00
|
|
|
dmu_objset_register_type(DMU_OST_ZFS, zpl_get_file_info);
|
2011-02-08 22:16:06 +03:00
|
|
|
register_filesystem(&zpl_fs_type);
|
2023-06-25 13:50:19 +03:00
|
|
|
#ifdef HAVE_VFS_FILE_OPERATIONS_EXTEND
|
|
|
|
register_fo_extend(&zpl_file_operations);
|
|
|
|
#endif
|
2011-02-08 22:16:06 +03:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
zfs_fini(void)
|
|
|
|
{
|
2016-10-28 23:37:00 +03:00
|
|
|
/*
|
|
|
|
* we don't use outstanding because zpl_posix_acl_free might add more.
|
|
|
|
*/
|
2016-12-01 00:56:50 +03:00
|
|
|
taskq_wait(system_delay_taskq);
|
2016-10-28 23:37:00 +03:00
|
|
|
taskq_wait(system_taskq);
|
2023-06-25 13:50:19 +03:00
|
|
|
#ifdef HAVE_VFS_FILE_OPERATIONS_EXTEND
|
|
|
|
unregister_fo_extend(&zpl_file_operations);
|
|
|
|
#endif
|
2011-02-08 22:16:06 +03:00
|
|
|
unregister_filesystem(&zpl_fs_type);
|
|
|
|
zfs_znode_fini();
|
2011-11-11 11:15:53 +04:00
|
|
|
zfsctl_fini();
|
2011-02-08 22:16:06 +03:00
|
|
|
}
|
2017-03-09 01:56:19 +03:00
|
|
|
|
2018-02-16 04:53:18 +03:00
|
|
|
#if defined(_KERNEL)
|
2017-03-09 01:56:19 +03:00
|
|
|
EXPORT_SYMBOL(zfs_suspend_fs);
|
|
|
|
EXPORT_SYMBOL(zfs_resume_fs);
|
|
|
|
EXPORT_SYMBOL(zfs_set_version);
|
|
|
|
EXPORT_SYMBOL(zfsvfs_create);
|
|
|
|
EXPORT_SYMBOL(zfsvfs_free);
|
|
|
|
EXPORT_SYMBOL(zfs_is_readonly);
|
|
|
|
EXPORT_SYMBOL(zfs_domount);
|
|
|
|
EXPORT_SYMBOL(zfs_preumount);
|
|
|
|
EXPORT_SYMBOL(zfs_umount);
|
|
|
|
EXPORT_SYMBOL(zfs_remount);
|
|
|
|
EXPORT_SYMBOL(zfs_statvfs);
|
|
|
|
EXPORT_SYMBOL(zfs_vget);
|
|
|
|
EXPORT_SYMBOL(zfs_prune);
|
|
|
|
#endif
|