Commit Graph

132 Commits

Author SHA1 Message Date
Yuri Pankov
cb110f254e OpenZFS 9763 - zfs(1M): broken formatting in allow/unallow description
Porting notes:
* Two of the three changes from the upstream patch were already
  applied for Linux.  Only the last one is required.

Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9763
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8a702e55
Closes #7972
2018-10-02 16:12:54 -07:00
Gregor Kopka
48b0b649fd Man page fixes - zpool/zfs optional parameters
The man pages for zpool and zfs (get command)
listed the pool/dataset parameter as required,
but these are optional. Fixed that.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gregor Kopka <gregor@kopka.net>
Closes #7916
2018-09-18 08:55:33 -07:00
LOLi
0238a9755b Fix 'zfs allow' for create time permissions
When no permission set is defined for a dataset the create time
permissions are incorrectly shown as if they were a permission set.
This change simply correct how allow permissions are displayed.

This commit also fixes a small manpage formatting issue and adds the
"zfs_allow_003_pos" test case to the ZFS Test Suite.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7519 
Closes #7860
2018-09-06 13:11:21 -07:00
Don Brady
cc99f275a2 Pool allocation classes
Allocation Classes add the ability to have allocation classes in a
pool that are dedicated to serving specific block categories, such
as DDT data, metadata, and small file blocks. A pool can opt-in to
this feature by adding a 'special' or 'dedup' top-level VDEV.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Reviewed-by: Andreas Dilger <andreas.dilger@chamcloud.com>
Reviewed-by: DHE <git@dehacked.net>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Gregor Kopka <gregor@kopka.net>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #5182
2018-09-05 18:33:36 -07:00
Serapheim Dimitropoulos
a448a2557e Introduce read/write kstats per dataset
The following patch introduces a few statistics on reads and writes
grouped by dataset. These statistics are implemented as kstats
(backed by aggregate sums for performance) and can be retrieved by
using the dataset objset ID number. The motivation for this change is
to provide some preliminary analytics on dataset usage/performance.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #7705
2018-08-20 09:52:37 -07:00
LOLi
a9d6270acb 'zfs holds' scripted mode is not documented
This change simply documents the existing "scripted mode" option in
both command help and man page.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7798
2018-08-18 15:47:41 -07:00
Tom Caputi
d9c460a0b6 Added encryption support for zfs recv -o / -x
One small integration that was absent from b52563 was
support for zfs recv -o / -x with regards to encryption
parameters. The main use cases of this are as follows:

* Receiving an unencrypted stream as encrypted without
  needing to create a "dummy" encrypted parent so that
  encryption can be inheritted.

* Allowing users to change their keylocation on receive,
  so long as the receiving dataset is an encryption root.

* Allowing users to explicitly exclude or override the
  encryption property from an unencrypted properties stream,
  allowing it to be received as encrypted.

* Receiving a recursive heirarchy of unencrypted datasets,
  encrypting the top-level one and forcing all children to
  inherit the encryption.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7650
2018-08-15 09:48:49 -07:00
Toomas Soome
5fadb7fb0c OpenZFS 8906 - uts: illumos rootfs should support salted cksum
Porting notes:
* As of grub-2.02 these checksums are not supported.  However, as
  pointed out in #6501 there are alternatives such as EFISTUB which
  work and have no such restriction.  A warning was added to the
  checksum property section of the zfs.8 man page.

Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: C Fraire <cfraire@me.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/8906
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f
Closes #6501
Closes #7714
2018-07-27 08:35:28 -07:00
Serapheim Dimitropoulos
a7ed98d8b5 OpenZFS 9330 - stack overflow when creating a deeply nested dataset
Datasets that are deeply nested (~100 levels) are impractical. We just
put a limit of 50 levels to newly created datasets. Existing datasets
should work without a problem.

The problem can be seen by attempting to create a dataset using the -p
option with many levels:

    panic[cpu0]/thread=ffffff01cd282c20: BAD TRAP: type=8 (#df Double fault) rp=ffffffff

    fffffffffbc3aa60 unix:die+100 ()
    fffffffffbc3ab70 unix:trap+157d ()
    ffffff00083d7020 unix:_patch_xrstorq_rbx+196 ()
    ffffff00083d7050 zfs:dbuf_rele+2e ()
    ...
    ffffff00083d7080 zfs:dsl_dir_close+32 ()
    ffffff00083d70b0 zfs:dsl_dir_evict+30 ()
    ffffff00083d70d0 zfs:dbuf_evict_user+4a ()
    ffffff00083d7100 zfs:dbuf_rele_and_unlock+87 ()
    ffffff00083d7130 zfs:dbuf_rele+2e ()
    ... The block above repeats once per directory in the ...
    ... create -p command, working towards the root ...
    ffffff00083db9f0 zfs:dsl_dataset_drop_ref+19 ()
    ffffff00083dba20 zfs:dsl_dataset_rele+42 ()
    ffffff00083dba70 zfs:dmu_objset_prefetch+e4 ()
    ffffff00083dbaa0 zfs:findfunc+23 ()
    ffffff00083dbb80 zfs:dmu_objset_find_spa+38c ()
    ffffff00083dbbc0 zfs:dmu_objset_find+40 ()
    ffffff00083dbc20 zfs:zfs_ioc_snapshot_list_next+4b ()
    ffffff00083dbcc0 zfs:zfsdev_ioctl+347 ()
    ffffff00083dbd00 genunix:cdev_ioctl+45 ()
    ffffff00083dbd40 specfs:spec_ioctl+5a ()
    ffffff00083dbdc0 genunix:fop_ioctl+7b ()
    ffffff00083dbec0 genunix:ioctl+18e ()
    ffffff00083dbf10 unix:brand_sys_sysenter+1c9 ()

Porting notes:
* Added zfs_max_dataset_nesting module option with documentation.
* Updated zfs_rename_014_neg.ksh for Linux.
* Increase the zfs.sh stack warning to 15K.  Enough time has passed
  that 16K can be reasonably assumed to be the default value.  It
  was increased in the 3.15 kernel released in June of 2014.

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>

OpenZFS-issue: https://www.illumos.org/issues/9330
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/757a75a
Closes #7681
2018-07-09 13:02:50 -07:00
Antonio Russo
7106b23640 Minor documentation, logging, and testing typos
This patch collects some minor inconsistencies and typos in the
documentation, logging and testing infrastructure.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #7608
2018-06-07 09:38:39 -07:00
Matthew Ahrens
a1d477c24c OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete

This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk.  The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool.  An entry becomes obsolete when all the blocks that use
it are freed.  An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones).  Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible.  This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of
the data that is copied.  This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.

At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.

Porting Notes:

* Avoid zero-sized kmem_alloc() in vdev_compact_children().

    The device evacuation code adds a dependency that
    vdev_compact_children() be able to properly empty the vdev_child
    array by setting it to NULL and zeroing vdev_children.  Under Linux,
    kmem_alloc() and related functions return a sentinel pointer rather
    than NULL for zero-sized allocations.

* Remove comment regarding "mpt" driver where zfs_remove_max_segment
  is initialized to SPA_MAXBLOCKSIZE.

  Change zfs_condense_indirect_commit_entry_delay_ticks to
  zfs_condense_indirect_commit_entry_delay_ms for consistency with
  most other tunables in which delays are specified in ms.

* ZTS changes:

    Use set_tunable rather than mdb
    Use zpool sync as appropriate
    Use sync_pool instead of sync
    Kill jobs during test_removal_with_operation to allow unmount/export
    Don't add non-disk names such as "mirror" or "raidz" to $DISKS
    Use $TEST_BASE_DIR instead of /tmp
    Increase HZ from 100 to 1000 which is more common on Linux

    removal_multiple_indirection.ksh
        Reduce iterations in order to not time out on the code
        coverage builders.

    removal_resume_export:
        Functionally, the test case is correct but there exists a race
        where the kernel thread hasn't been fully started yet and is
        not visible.  Wait for up to 1 second for the removal thread
        to be started before giving up on it.  Also, increase the
        amount of data copied in order that the removal not finish
        before the export has a chance to fail.

* MMP compatibility, the concept of concrete versus non-concrete devices
  has slightly changed the semantics of vdev_writeable().  Update
  mmp_random_leaf_impl() accordingly.

* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
  feature which is not supported by OpenZFS.

* Added support for new vdev removal tracepoints.

* Test cases removal_with_zdb and removal_condense_export have been
  intentionally disabled.  When run manually they pass as intended,
  but when running in the automated test environment they produce
  unreliable results on the latest Fedora release.

  They may work better once the upstream pool import refectoring is
  merged into ZoL at which point they will be re-enabled.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2018-04-14 12:16:17 -07:00
Mike Gerdts
d22f3a8244 OpenZFS 9286 - want refreservation=auto
Authored by: Mike Gerdts <mike.gerdts@joyent.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Don Brady <don.brady@delphix.com>

Porting Notes:
* Adopted destroy_dataset in ZTS test cleanup
* Use ksh shebang instead of bash for new tests

OpenZFS-issue: https://www.illumos.org/issues/9286
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/723d0c85
Closes #7387
2018-04-11 14:52:13 -07:00
DeHackEd
dfb1ad027f zfs(8): fix dedup omission during mdoc overhaul
The property description has been updated with new algorithms as well.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: DHE <git@dehacked.net>
Closes #7377
2018-04-10 17:19:01 -07:00
Alek P
272b5d730f Add JSON output support to channel programs
The changes piggyback JSON output support on top of channel programs 
(#6558).  This way the JSON output support is targeted to scripting 
use cases and is easily maintainable since it really only touches 
one function (zfs_do_channel_program()).

This patch ports Joyent's JSON nvlist library from illumos to enable 
easy JSON printing of channel program output nvlist.  To keep the 
delta small I also took advantage of the fact that printing in
zfs_do_channel_program() was almost always done before exiting 
the program.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #7281
2018-03-19 12:40:58 -07:00
Paul Zuchowski
83362e8e67 Destroy makes full snap list before destroying
Change zfs destroy logic so destroying begins before
the entire list of snapshots is built.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7271
2018-03-12 15:24:08 -07:00
LOLi
faa97c1619 Want 'zfs send -b'
This change implements 'zfs send -b' which can be used to send only
received property values whether or not they are overridden by local
settings.

This can be very useful during "restore" operations from a backup pool
because it allows to send only the property values originally sent
from the backup source, even though they were later modified on the
destination either by a 'zfs set' operation, explicit 'zfs inherit' or
overridden during the receive process via 'zfs receive -o|-x'.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7156
2018-02-21 12:32:06 -08:00
Nasf-Fan
9c5167d19f Project Quota on ZFS
Project quota is a new ZFS system space/object usage accounting
and enforcement mechanism. Similar as user/group quota, project
quota is another dimension of system quota. It bases on the new
object attribute - project ID.

Project ID is a numerical value to indicate to which project an
object belongs. An object only can belong to one project though
you (the object owner or privileged user) can change the object
project ID via 'chattr -p' or 'zfs project [-s] -p' explicitly.
The object also can inherit the project ID from its parent when
created if the parent has the project inherit flag (that can be
set via 'chattr +P' or 'zfs project -s [-p]').

By accounting the spaces/objects belong to the same project, we
can know how many spaces/objects used by the project. And if we
set the upper limit then we can control the spaces/objects that
are consumed by such project. It is useful when multiple groups
and users cooperate for the same project, or a user/group needs
to participate in multiple projects.

Support the following commands and functionalities:

zfs set projectquota@project
zfs set projectobjquota@project

zfs get projectquota@project
zfs get projectobjquota@project
zfs get projectused@project
zfs get projectobjused@project

zfs projectspace

zfs allow projectquota
zfs allow projectobjquota
zfs allow projectused
zfs allow projectobjused

zfs unallow projectquota
zfs unallow projectobjquota
zfs unallow projectused
zfs unallow projectobjused

chattr +/-P
chattr -p project_id
lsattr -p

This patch also supports tree quota based on the project quota via
"zfs project" commands set as following:
zfs project [-d|-r] <file|directory ...>
zfs project -C [-k] [-r] <file|directory ...>
zfs project -c [-0] [-d|-r] [-p id] <file|directory ...>
zfs project [-p id] [-r] [-s] <file|directory ...>

For "df [-i] $DIR" command, if we set INHERIT (project ID) flag on
the $DIR, then the proejct [obj]quota and [obj]used values for the
$DIR's project ID will be shown as the total/free (avail) resource.
Keep the same behavior as EXT4/XFS does.

Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by  Ned Bass <bass6@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fan Yong <fan.yong@intel.com>
TEST_ZIMPORT_POOLS="zol-0.6.1 zol-0.6.2 master"
Change-Id: Ib4f0544602e03fb61fd46a849d7ba51a6005693c
Closes #6290
2018-02-13 14:54:54 -08:00
Serapheim Dimitropoulos
5b72a38d68 OpenZFS 8677 - Open-Context Channel Programs
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

We want to be able to run channel programs outside of synching
context. This would greatly improve performance for channel programs
that just gather information, as they won't have to wait for synching
context anymore.

=== What is implemented?

This feature introduces the following:
- A new command line flag in "zfs program" to specify our intention
  to run in open context. (The -n option)
- A new flag/option within the channel program ioctl which selects
  the context.
- Appropriate error handling whenever we try a channel program in
  open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.

=== How do we handle zfs.sync functions in open context?

When such a function is found by the interpreter and we are running
in open context we abort the script and we spit out a descriptive
runtime error. For example, given the script below ...

arg = ...
fs = arg["argv"][1]
err = zfs.sync.destroy(fs)
msg = "destroying " .. fs .. " err=" .. err
return msg

if we run it in open context, we will get back the following error:

Channel program execution failed:
[string "channel program"]:3: running functions from the zfs.sync
submodule requires passing sync=TRUE to lzc_channel_program()
(i.e. do not specify the "-n" command line argument)
stack traceback:
            [C]: in function 'destroy'
            [string "channel program"]:3: in main chunk

=== What about testing?

We've introduced new wrappers for all channel program tests that
run each channel program as both (startard & open-context) and
expect the appropriate behavior depending on the program using
the zfs.sync module.

OpenZFS-issue: https://www.illumos.org/issues/8677
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/17a49e15
Closes #6558
2018-02-08 16:05:57 -08:00
Chris Williamson
d99a015343 OpenZFS 7431 - ZFS Channel Programs
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Don Brady <don.brady@delphix.com>
Ported-by: John Kennedy <john.kennedy@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/7431
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/dfc11533

Porting Notes:
* The CLI long option arguments for '-t' and '-m' don't parse on linux
* Switched from kmem_alloc to vmem_alloc in zcp_lua_alloc
* Lua implementation is built as its own module (zlua.ko)
* Lua headers consumed directly by zfs code moved to 'include/sys/lua/'
* There is no native setjmp/longjump available in stock Linux kernel.
  Brought over implementations from illumos and FreeBSD
* The get_temporary_prop() was adapted due to VFS platform differences
* Use of inline functions in lua parser to reduce stack usage per C call
* Skip some ZFS Test Suite ZCP tests on sparc64 to avoid stack overflow
2018-02-08 15:28:18 -08:00
LOLi
63f88c12b4 Fix style issues in man pages and commands help
* Remove 'zfs snap' from zfs help message (OpenZFS sync)
* Update zfs(8) to suggest 'snap' can be used as an alias for 'snapshot'
* Enforce 80 columns limit in help messages
* Remove zfs_disable_dup_eviction from zfs-module-parameters(5)
* Expose zfs_scan_max_ext_gap as a kernel module parameter.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7087
2018-01-29 15:05:03 -08:00
Yuri Pankov
bcb1a8a25e OpenZFS 8898 - creating fs with checksum=skein on the boot pools fails ungracefully
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8898
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9fa2266d9a
Closes #7031
2018-01-11 13:53:04 -08:00
Antonio Russo
5c2552c564 systemd zfs-import.target and documentation
zfs-import-{cache,scan}.service must complete before any mounting of
filesystems can occur. To simplify this dependency, create a target
that is reached After (in the systemd sense) the pool is imported.

Additionally, recommend that legacy zfs mounts use the option

x-systemd.requires=zfs-import.target

to codify this requirement.

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #6764
2017-10-30 13:18:26 -07:00
Tom Caputi
4807c0badb Encryption patch follow-up
* PBKDF2 implementation changed to OpenSSL implementation.

* HKDF implementation moved to its own file and tests
  added to ensure correctness.

* Removed libzfs's now unnecessary dependency on libzpool
  and libicp.

* Ztest can now create and test encrypted datasets. This is
  currently disabled until issue #6526 is resolved, but
  otherwise functions as advertised.

* Several small bug fixes discovered after enabling ztest
  to run on encrypted datasets.

* Fixed coverity defects added by the encryption patch.

* Updated man pages for encrypted send / receive behavior.

* Fixed a bug where encrypted datasets could receive
  DRR_WRITE_EMBEDDED records.

* Minor code cleanups / consolidation.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
2017-10-11 16:54:48 -04:00
LOLi
90cdf2833d Add mdoc style checker
Add a new make 'mancheck' target which uses mandoc -Tlint to verify
manpage files: currently only zfs(8), zpool(8) zdb(8) and zgenhostid(8)
are supported.

Additionally fix some outstanding manpage formatting issues.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6646
2017-09-16 10:51:24 -07:00
LOLi
835db58592 Add -vnP support to 'zfs send' for bookmarks
This leverages the functionality introduced in cf7684b to expose
verbose, dry-run and parsable 'zfs send' options for bookmarks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #3666 
Closes #6601
2017-09-08 15:24:31 -07:00
Mike Swanson
57858fb5ca Recommend compression=on in zfs(8) dedup section
compression=lz4 depends on the lz4 feature being enabled, while
compression=on will let ZFS use either lzjb or lz4 where appropriate.
It also allows the documentation to not go out of date if/when ZFS
picks a new default in the future.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mike Swanson <mikeonthecomputer@gmail.com>
Closes #6614
2017-09-08 15:21:58 -07:00
bunder2015
2917956841 zfs(8) manpage corrections
Corrected indent of the note located at the bottom of the options for
zfs send as well as remove an extra whitespace

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #6590
2017-09-05 13:45:18 -07:00
Giuseppe Di Natale
d7323e79a6 OpenZFS 8547 - update mandoc to 1.14.3
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8547
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c66b804
Closes #6549
2017-08-24 10:30:42 -07:00
Tom Caputi
9b8407638d Send / Recv Fixes following b52563
This patch fixes several issues discovered after
the encryption patch was merged:

* Fixed a bug where encrypted datasets could attempt
  to receive embedded data records.

* Fixed a bug where dirty records created by the recv
  code wasn't properly setting the dr_raw flag.

* Fixed a typo where a dmu_tx_commit() was changed to
  dmu_tx_abort()

* Fixed a few error handling bugs unrelated to the
  encryption patch in dmu_recv_stream()

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6512 
Closes #6524 
Closes #6545
2017-08-23 16:54:24 -07:00
Tom Caputi
b525630342 Native Encryption for ZFS on Linux
This change incorporates three major pieces:

The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.

The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.

The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494 
Closes #5769
2017-08-14 10:36:48 -07:00
LOLi
cf8738d853 Add port of FreeBSD 'volmode' property
The volmode property may be set to control the visibility of ZVOL
block devices.

This allow switching ZVOL between three modes:
   full - existing fully functional behaviour (default)
   dev  - hide partitions on ZVOL block devices
   none - not exposing volumes outside ZFS

Additionally the new zvol_volmode module parameter can be used to
control the default behaviour.

This functionality can be used, for instance, on "backup" pools to
avoid cluttering /dev with unneeded zd* devices.

Original-patch-by: mav <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>

FreeBSD-commit: https://github.com/freebsd/freebsd/commit/dd28e6bb
Closes #1796 
Closes #3438 
Closes #6233
2017-07-12 13:05:37 -07:00
Brian Behlendorf
44f09cdc59 Convert man zfs.8 to mdoc (OpenZFS sync)
* Fixed some typos
* Additional description for some commands arguments
* Text reworked to be in sync with OpenZFS
* Added Linux as .Os type

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6282
2017-06-29 09:55:30 -07:00
LOLi
a3eeab2de6 Add property overriding (-o|-x) to 'zfs receive'
This allows users to specify "-o property=value" to override and
"-x property" to exclude properties when receiving a zfs send stream.
Both native and user properties can be specified.

This is useful when using zfs send/receive for periodic
backup/replication because it lets users change properties such as
canmount, mountpoint, or compression without modifying the source.

References:
   https://www.illumos.org/issues/2745
   https://www.illumos.org/issues/3753

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #1350 
Closes #5349
2017-05-09 16:21:09 -07:00
Christian Schwarz
305bc4b370 Make createtxg and guid properties public
Document the existence of `createtxg` and `guid` native properties
in man pages and zfs command output.

One of the great features of ZFS is incremental replication of
snapshots, possibly between pools on different machines.

Shell scripts are commonly used to auomate this procedure. They have to
find the most recent common snapshot between both sides and then
perform incremental send & recv.
Currently, scripts rely on the sorting order of `zfs list`, which
defaults to `createtxg`, and the assumption that snapshot names on
either side do not change.

By making `createtxg` and `guid` part of the public ZFS interface,
scripts are enabled to use

  a) `createtxg` to determine the logical & temporal order of snapshots
     (the creation property is not an equivalent substitute since
      multiple snapshots may be created within one second)
  b) `guid` to uniquely identify a snapshot, independent of its current
      display name

This has the potential of making scripts safer and correct.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: DHE <git@dehacked.net>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #6102
2017-05-09 15:36:53 -07:00
Dan Kimmel
a7004725d0 OpenZFS 7252 - compressed zfs send / receive
OpenZFS 7252 - compressed zfs send / receive
OpenZFS 7628 - create long versions of ZFS send / receive options

Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Ported-by: bunder2015 <omfgbunder@gmail.com>
Ported-by: Don Brady <don.brady@intel.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
- Most of 7252 was already picked up during ABD work.  This
  commit represents the gap from the final commit to openzfs.
- Fixed split_large_blocks check in do_dump()
- An alternate version of the write_compressible() function was
  implemented for Linux which does not depend on fio.  The behavior
  of fio differs significantly based on the exact version.
- mkholes was replaced with truncate for Linux.

OpenZFS-issue: https://www.illumos.org/issues/7252
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5602294
Closes #6067
2017-04-26 12:31:43 -07:00
LOLi
038091fd4f Documentation fixes for zfs(8) and 'zfs' binary
* bookmarks are not supported when sending all intermediary snaps (-I)
* add missing compressed (-c) option to the 'zfs' help and manpage

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6028
2017-04-20 12:12:50 -07:00
DeHackEd
f974e41426 zfs(8) fixes
Documentation fixes for zfs(8)

* White space issue in the userused@user property section
* zfs send supports using bookmarks as the origin snapshot

Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #5906
2017-03-20 15:14:28 -07:00
Chunwei Chen
9b77d1c958 Fix nfs snapdir automount
The current implementation for allowing nfs to access snapdir is very buggy.
It uses a special fh for snapdirs, such that the next time nfsd does
fh_to_dentry, it actually returns the root inode inside the snapshot. So nfsd
never knows it cross a mountpoint.

The problem is that nfsd will not hold a reference on the vfsmount of the
snapshot. This cause auto unmounter to unmount the snapshot even though nfs is
still holding dentries in it.

To fix this, we return the inode for the snapdirs themselves. However, we also
trigger automount upon fh_to_dentry, and return ESTALE so nfsd will revalidate
and see the mountpoint and do crossmnt.

Because nfsd will now be aware that these are different filesystems users
must add crossmnt to their export options to access snapshot directories.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #3794
Closes #4716
Closes #5810 
Closes #5833
2017-03-08 09:26:33 -08:00
George Melikov
aeacdefedc OpenZFS 7386 - zfs get does not work properly with bookmarks
Authored by: Marcel Telka <marcel@telka.sk>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7386
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/edb901a
Closes #5666
2017-01-26 14:42:15 -08:00
Håkan Johansson
9ef3906a5a Minor man-page formatting fixes
fB -> \fB in zpool.8 (Properties -> cachefile)
\fN -> \fB in zfs-module-parameters.5 (zfs_dirty_data_max_max_percent)
Three | -> \fR|\fI fixes for arguments of diff and inherit in zfs.8.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes #5645
2017-01-24 09:09:02 -08:00
George Melikov
a0aacd3741 OpenZFS 7257 - zfs manpage user property length needs to be updated
Since zpool version 16, this limit is actually 8192 characters.
Additionally, this limit is actually 8192 bytes, as it supports UTF-8.

Authored by: Eli Rosenthal <eli.rosenthal@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7257
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3bc7169
Closes #5608
2017-01-17 15:30:01 -08:00
George Melikov
fdbaf44ffb OpenZFS 7276 - zfs(1m) manpage could better describe space properties
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/7276
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d750135
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/29c6739
Closes #5549
2017-01-13 13:31:29 -08:00
bunder2015
53ed2db212 Remove extra + from zfs man page
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #5508
2016-12-21 11:06:02 -08:00
LOLi
2f71caf2d9 Allow zfs unshare <protocol> -a
Allow `zfs unshare <protocol> -a` command to share or unshare all datasets
of a given protocol, nfs or smb.

Additionally, enable most of ZFS Test Suite zfs_share/zfs_unshare test cases.
To work around some Illumos-specific functionalities ($SHARE/$UNSHARE) some
function wrappers were added around them.

Finally, fix and issue in smb_is_share_active() that would leave SMB shares
exported when invoking 'zfs unshare -a'

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #3238 
Closes #5367
2016-11-29 12:22:38 -07:00
Jinshan Xiong
1de321e626 Add support for user/group dnode accounting & quota
This patch tracks dnode usage for each user/group in the
DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode
accounting have the key prefixed with "obj-" followed by the UID/GID
in string format (as done for the block accounting).
A new SPA feature has been added for dnode accounting as well as
a new ZPL version. The SPA feature must be enabled in the pool
before upgrading the zfs filesystem. During the zfs version upgrade,
a "quotacheck" will be executed by marking all dnode as dirty.

ZoL-bug-id: https://github.com/zfsonlinux/zfs/issues/3500

Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Johann Lombardi <johann.lombardi@intel.com>
2016-10-07 09:45:13 -07:00
Tony Hutter
3c67d83a8a OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported by: Tony Hutter <hutter2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/4185
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee

Porting Notes:
This code is ported on top of the Illumos Crypto Framework code:

    b5e030c8db

The list of porting changes includes:

- Copied module/icp/include/sha2/sha2.h directly from illumos

- Removed from module/icp/algs/sha2/sha2.c:
	#pragma inline(SHA256Init, SHA384Init, SHA512Init)

- Added 'ctx' to lib/libzfs/libzfs_sendrecv.c:zio_checksum_SHA256() since
  it now takes in an extra parameter.

- Added CTASSERT() to assert.h from for module/zfs/edonr_zfs.c

- Added skein & edonr to libicp/Makefile.am

- Added sha512.S.  It was generated from sha512-x86_64.pl in Illumos.

- Updated ztest.c with new fletcher_4_*() args; used NULL for new CTX argument.

- In icp/algs/edonr/edonr_byteorder.h, Removed the #if defined(__linux) section
  to not #include the non-existant endian.h.

- In skein_test.c, renane NULL to 0 in "no test vector" array entries to get
  around a compiler warning.

- Fixup test files:
	- Rename <sys/varargs.h> -> <varargs.h>, <strings.h> -> <string.h>,
	- Remove <note.h> and define NOTE() as NOP.
	- Define u_longlong_t
	- Rename "#!/usr/bin/ksh" -> "#!/bin/ksh -p"
	- Rename NULL to 0 in "no test vector" array entries to get around a
	  compiler warning.
	- Remove "for isa in $($ISAINFO); do" stuff
	- Add/update Makefiles
	- Add some userspace headers like stdio.h/stdlib.h in places of
	  sys/types.h.

- EXPORT_SYMBOL *_Init/*_Update/*_Final... routines in ICP modules.

- Update scripts/zfs2zol-patch.sed

- include <sys/sha2.h> in sha2_impl.h

- Add sha2.h to include/sys/Makefile.am

- Add skein and edonr dirs to icp Makefile

- Add new checksums to zpool_get.cfg

- Move checksum switch block from zfs_secpolicy_setprop() to
  zfs_check_settable()

- Fix -Wuninitialized error in edonr_byteorder.h on PPC

- Fix stack frame size errors on ARM32
  	- Don't unroll loops in Skein on 32-bit to save stack space
  	- Add memory barriers in sha2.c on 32-bit to save stack space

- Add filetest_001_pos.ksh checksum sanity test

- Add option to write psudorandom data in file_write utility
2016-10-03 14:51:15 -07:00
Dan Kimmel
2aa34383b9 DLPX-40252 integrate EP-476 compressed zfs send/receive
Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: David Quigley <david.quigley@intel.com>
Issue #5078
2016-09-13 09:58:58 -07:00
Paul Dagnelie
e6d3a843d6 OpenZFS 6393 - zfs receive a full send as a clone
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6394
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e
2016-06-28 13:47:03 -07:00
Matthew Ahrens
47dfff3b86 OpenZFS 2605, 6980, 6902
2605 want to resume interrupted zfs send
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/2605
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12

6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6980
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f

Porting notes:
- All rsend and snapshop tests enabled and updated for Linux.
- Fix misuse of input argument in traverse_visitbp().
- Fix ISO C90 warnings and errors.
- Fix gcc 'missing braces around initializer' in
  'struct send_thread_arg to_arg =' warning.
- Replace 4 argument fletcher_4_native() with 3 argument version,
  this change was made in OpenZFS 4185 which has not been ported.
- Part of the sections for 'zfs receive' and 'zfs send' was
  rewritten and reordered to approximate upstream.
- Fix mktree xattr creation, 'user.' prefix required.
- Minor fixes to newly enabled test cases
- Long holds for volumes allowed during receive for minor registration.
2016-06-28 13:47:02 -07:00
Ned Bass
50c957f702 Implement large_dnode pool feature
Justification
-------------

This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks.  Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided.  Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks.  Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.

ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.

Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.

Implementation
--------------

The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.

Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.

The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run

 # zfs set dnodesize=auto tank/fish

The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.

The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.

New DMU interfaces:
  dmu_object_alloc_dnsize()
  dmu_object_claim_dnsize()
  dmu_object_reclaim_dnsize()

New ZAP interfaces:
  zap_create_dnsize()
  zap_create_norm_dnsize()
  zap_create_flags_dnsize()
  zap_create_claim_norm_dnsize()
  zap_create_link_dnsize()

The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.

These are a few noteworthy changes to key functions:

* The prototype for dnode_hold_impl() now takes a "slots" parameter.
  When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
  ensure the hole at the specified object offset is large enough to
  hold the dnode being created. The slots parameter is also used
  to ensure a dnode does not span multiple dnode blocks. In both of
  these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
  these failure cases are only possible when using DNODE_MUST_BE_FREE.

  If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
  dnode_hold_impl() will check if the requested dnode is already
  consumed as an extra dnode slot by an large dnode, in which case
  it returns ENOENT.

* The function dmu_object_alloc() advances to the next dnode block
  if dnode_hold_impl() returns an error for a requested object.
  This is because the beginning of the next dnode block is the only
  location it can safely assume to either be a hole or a valid
  starting point for a dnode.

* dnode_next_offset_level() and other functions that iterate
  through dnode blocks may no longer use a simple array indexing
  scheme. These now use the current dnode's dn_num_slots field to
  advance to the next dnode in the block. This is to ensure we
  properly skip the current dnode's bonus area and don't interpret it
  as a valid dnode.

zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.

For ZIL create log records, zdb will now display the slot count for
the object.

ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.

Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number.  This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.

ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.

Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.

While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.

For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.

ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.

Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.

Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-06-24 13:13:21 -07:00