This patch adds the ability for zinject to trigger decryption
and authentication faults in the ZIO and ARC layers. This
functionality is exposed via the new "decrypt" error type, which
may be provided for "data" object types.
This patch also refactors some of the core encryption / decryption
functions so that they have consistent prototypes, handle errors
consistently, and do not have unused arguments.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7474
Two deadlocks / ASSERT failures were introduced in a2c2ed1b which
would occur whenever arc_buf_fill() failed to decrypt a block of
data. This occurred because the call to arc_buf_destroy() which
was responsible for cleaning up the newly created buffer would
attempt to take out the hdr lock that it was already holding. This
was resolved by calling the underlying functions directly without
retaking the lock.
In addition, the dmu_diff() code did not properly ensure that keys
were loaded and mapped before begining dataset traversal. It turns
out that this code does not need to look at any encrypted values,
so the code was altered to perform raw IO only.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7354Closes#7456
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete
This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk. The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.
The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool. An entry becomes obsolete when all the blocks that use
it are freed. An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones). Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible. This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.
Note that when a device is removed, we do not verify the checksum of
the data that is copied. This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.
At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.
Porting Notes:
* Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
* Remove comment regarding "mpt" driver where zfs_remove_max_segment
is initialized to SPA_MAXBLOCKSIZE.
Change zfs_condense_indirect_commit_entry_delay_ticks to
zfs_condense_indirect_commit_entry_delay_ms for consistency with
most other tunables in which delays are specified in ms.
* ZTS changes:
Use set_tunable rather than mdb
Use zpool sync as appropriate
Use sync_pool instead of sync
Kill jobs during test_removal_with_operation to allow unmount/export
Don't add non-disk names such as "mirror" or "raidz" to $DISKS
Use $TEST_BASE_DIR instead of /tmp
Increase HZ from 100 to 1000 which is more common on Linux
removal_multiple_indirection.ksh
Reduce iterations in order to not time out on the code
coverage builders.
removal_resume_export:
Functionally, the test case is correct but there exists a race
where the kernel thread hasn't been fully started yet and is
not visible. Wait for up to 1 second for the removal thread
to be started before giving up on it. Also, increase the
amount of data copied in order that the removal not finish
before the export has a chance to fail.
* MMP compatibility, the concept of concrete versus non-concrete devices
has slightly changed the semantics of vdev_writeable(). Update
mmp_random_leaf_impl() accordingly.
* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
feature which is not supported by OpenZFS.
* Added support for new vdev removal tracepoints.
* Test cases removal_with_zdb and removal_condense_export have been
intentionally disabled. When run manually they pass as intended,
but when running in the automated test environment they produce
unreliable results on the latest Fedora release.
They may work better once the upstream pool import refectoring is
merged into ZoL at which point they will be re-enabled.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1ebCloses#6900
When setting `zfs_arc_max` its minimum value is allowed
to be 64 MiB. There was an off-by-1 error which can matter
on tiny systems.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Zubrzycki <github@mid-earth.net>
Closes#7417
The ASSERT was erroneously copied from the next section of code.
The buffer's size should be expanded from "psize" to "asize"
if necessary.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#7375
Currently, the decryption and block authentication code in
the ZIO / ARC layers is a bit inconsistent with regards to
the ereports that are produces and the error codes that are
passed to calling functions. This patch ensures that all of
these errors (which begin as ECKSUM) are converted to EIO
before they leave the ZIO or ARC layer and that ereports
are correctly generated on each decryption / authentication
failure.
In addition, this patch fixes a bug in zio_decrypt() where
ECKSUM never gets written to zio->io_error.
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7372
Authored by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/9321
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/92b05f3a18Closes#7333
Currently, ZFS tracks statistics about calls to arc_read()
via the /proc/spl/kstat/zfs/<pool>/reads file for debugging.
Unfortunately, this file currently counts embedded bps as
disk reads since they are technically processed by the ZIO
layer. This pollutes the log since the ARC will never cache
embedded bps. This patch corrects this issue by preventing
the logging of embedded bp reads.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7334
zfs_dbgmsg() should record a message by default. As a general
principal, these messages shouldn't be too verbose. Furthermore, the
amount of memory used is limited to 4MB (by default).
dprintf() should only record a message if this is a debug build, and
ZFS_DEBUG_DPRINTF is set in zfs_flags. This flag is not set by default
(even on debug builds). These messages are extremely verbose, and
sometimes nontrivial to compute.
SET_ERROR() should only record a message if ZFS_DEBUG_SET_ERROR is set
in zfs_flags. This flag is not set by default (even on debug builds).
This brings our behavior in line with illumos. Note that the message
format is unchanged (including file, line, and function, even though
these are not recorded on illumos).
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#7314
Provide infrastructure to auto-configure to enum and API changes in the
global page stats used for our free memory calculations.
arc_free_memory has been broken since an API change in Linux v3.14:
2016-07-28 v4.8 599d0c95 mm, vmscan: move LRU lists to node
2016-07-28 v4.8 75ef7184 mm, vmstat: add infrastructure for per-node
vmstats
These commits moved some of global_page_state() into
global_node_page_state(). The API change was particularly egregious as,
instead of breaking the old code, it silently did the wrong thing and we
continued using global_page_state() where we should have been using
global_node_page_state(), thus indexing into the wrong array via
NR_SLAB_RECLAIMABLE et al.
There have been further API changes along the way:
2017-07-06 v4.13 385386cf mm: vmstat: move slab statistics from zone to
node counters
2017-09-06 v4.14 c41f012a mm: rename global_page_state to
global_zone_page_state
...and various (incomplete, as it turns out) attempts to accomodate
these changes in ZoL:
2017-08-24 2209e409 Linux 4.8+ compatibility fix for vm stats
2017-09-16 787acae0 Linux 3.14 compat: IO acct, global_page_state, etc
2017-09-19 661907e6 Linux 4.14 compat: IO acct, global_page_state, etc
The config infrastructure provided here resolves these issues going back
to the original API change in v3.14 and is robust against further Linux
changes in this area.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes#7170
Currently, the DMU relies on ZIO layer compression to free LO
dnode blocks that no longer have objects in them. However,
raw receives disable all compression, meaning that these blocks
can never be freed. In addition to the obvious space concerns,
this could also cause incremental raw receives to fail to mount
since the MAC of a hole is different from that of a completely
zeroed block.
This patch corrects this issue by adding a special case in
zio_write_compress() which will attempt to compress these blocks
to a hole even if ZIO_FLAG_RAW_ENCRYPT is set. This patch also
removes the zfs_mdcomp_disable tunable, since tuning it could
cause these same issues.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7198
zfs_arc_p_aggressive_disable is no more. This PR removes docs
and module parameters for zfs_arc_p_aggressive_disable.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Richard Elling <Richard.Elling@RichardElling.com>
Closes#7135
Currently, the ARC exposes 2 tunables (zfs_arc_min_prefetch_ms
and zfs_arc_min_prescient_prefetch_ms) which are documented
to be specified in milliseconds. However, the code actually
uses the values as though they were in seconds. This patch
adjusts the code to match the names and documentation of the
tunables.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#7126
The on-disk format for encrypted datasets protects not only
the encrypted and authenticated blocks themselves, but also
the order and interpretation of these blocks. In order to
make this work while maintaining the ability to do raw
sends, the indirect bps maintain a secure checksum of all
the MACs in the block below it along with a few other
fields that determine how the data is interpreted.
Unfortunately, the current on-disk format erroneously
includes some fields which are not portable and thus cannot
support raw sends. It is not possible to easily work around
this issue due to a separate and much smaller bug which
causes indirect blocks for encrypted dnodes to not be
compressed, which conflicts with the previous bug. In
addition, the current code generates incompatible on-disk
formats on big endian and little endian systems due to an
issue with how block pointers are authenticated. Finally,
raw send streams do not currently include dn_maxblkid when
sending both the metadnode and normal dnodes which are
needed in order to ensure that we are correctly maintaining
the portable objset MAC.
This patch zero's out the offending fields when computing
the bp MAC and ensures that these MACs are always
calculated in little endian order (regardless of the host
system's byte order). This patch also registers an errata
for the old on-disk format, which we detect by adding a
"version" field to newly created DSL Crypto Keys. We allow
datasets without a version (version 0) to only be mounted
for read so that they can easily be migrated. We also now
include dn_maxblkid in raw send streams to ensure the MAC
can be maintained correctly.
This patch also contains minor bug fixes and cleanups.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#6845Closes#6864Closes#7052
When the compressed ARC feature was added in commit d3c2ae1
the method of reference counting in the ARC was modified. As
part of this accounting change the arc_buf_add_ref() function
was removed entirely.
This would have be fine but the arc_buf_add_ref() function
served a second undocumented purpose of updating the ARC access
information when taking a hold on a dbuf. Without this logic
in place a cached dbuf would not migrate its associated
arc_buf_hdr_t to the MFU list. This would negatively impact
the ARC hit rate, particularly on systems with a small ARC.
This change reinstates the missing call to arc_access() from
dbuf_hold() by implementing a new arc_buf_access() function.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6171Closes#6852Closes#6989
When sequential scrubs were merged, all calls to arc_read()
(including prefetch IOs) were given ZIO_PRIORITY_ASYNC_READ.
Unfortunately, this behaves badly with an existing issue where
prefetch IOs cannot be re-prioritized after the issue. The
result is that synchronous reads end up in the same vdev_queue
as the scrub IOs and can have (in some workloads) multiple
seconds of latency.
This patch incorporates 2 changes. The first ensures that all
scrub IOs are given ZIO_PRIORITY_SCRUB to allow the vdev_queue
code to differentiate between these I/Os and user prefetches.
Second, this patch introduces zio_change_priority() to provide
the missing capability to upgrade a zio's priority.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#6921Closes#6926
Only access the `b_crypt_hdr` field of an ARC header if the content
is encrypted.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes#6877
Currently, scrubs and resilvers can take an extremely
long time to complete. This is largely due to the fact
that zfs scans process pools in logical order, as
determined by each block's bookmark. This makes sense
from a simplicity perspective, but blocks in zfs are
often scattered randomly across disks, particularly
due to zfs's copy-on-write mechanisms.
This patch improves performance by splitting scrubs
and resilvers into a metadata scanning phase and an IO
issuing phase. The metadata scan reads through the
structure of the pool and gathers an in-memory queue
of I/Os, sorted by size and offset on disk. The issuing
phase will then issue the scrub I/Os as sequentially as
possible, greatly improving performance.
This patch also updates and cleans up some of the scan
code which has not been updated in several years.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Authored-by: Alek Pinchuk <apinchuk@datto.com>
Authored-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#3625Closes#6256
In __dbuf_hold_impl(), if a buffer is currently syncing and is still
referenced from db_data, a copy is made in case it is dirtied again in
the txg. Previously, the buffer for the copy was simply allocated with
arc_alloc_buf() which doesn't handle compressed or encrypted buffers
(which are a special case of a compressed buffer). The result was
typically an invalid memory access because the newly-allocated buffer
was of the uncompressed size.
This commit fixes the problem by handling the 2 compressed cases,
encrypted and unencrypted, respectively, with arc_alloc_raw_buf() and
arc_alloc_compressed_buf().
Although using the proper allocation functions fixes the invalid memory
access by allocating a buffer of the compressed size, another unrelated
issue made it impossible to properly detect compressed buffers in the
first place. The header's compression flag was set to ZIO_COMPRESS_OFF
in arc_write() when it was possible that an attached buffer was actually
compressed. This commit adds logic to only set ZIO_COMPRESS_OFF in
the non-ZIO_RAW case which wil handle both cases of compressed buffers
(encrypted or unencrypted).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#5742Closes#6797
Fixed build regression in non-debug builds from recent cleanups of
c89 workarounds.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#6832
With PR 5756 the zfs module now supports c99 and the
remaining past c89 workarounds can be undone.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes#6816
Fix compiler warnings in zdb. With these changes, FreeBSD can compile
zdb with all compiler warnings enabled save -Wunused-parameter.
usr/src/cmd/zdb/zdb.c
usr/src/cmd/zdb/zdb_il.c
usr/src/uts/common/fs/zfs/sys/sa.h
usr/src/uts/common/fs/zfs/sys/spa.h
Fix numerous warnings, including:
* const-correctness
* shadowing global definitions
* signed vs unsigned comparisons
* missing prototypes, or missing static declarations
* unused variables and functions
* Unreadable array initializations
* Missing struct initializers
usr/src/cmd/zdb/zdb.h
Add a header file to declare common symbols
usr/src/lib/libzpool/common/sys/zfs_context.h
usr/src/uts/common/fs/zfs/arc.c
usr/src/uts/common/fs/zfs/dbuf.c
usr/src/uts/common/fs/zfs/spa.c
usr/src/uts/common/fs/zfs/txg.c
Add a function prototype for zk_thread_create, and ensure that every
callback supplied to this function actually matches the prototype.
usr/src/cmd/ztest/ztest.c
usr/src/uts/common/fs/zfs/sys/zil.h
usr/src/uts/common/fs/zfs/zfs_replay.c
usr/src/uts/common/fs/zfs/zvol.c
Add a function prototype for zil_replay_func_t, and ensure that
every function of this type actually matches the prototype.
usr/src/uts/common/fs/zfs/sys/refcount.h
Change FTAG so it discards any constness of __func__, necessary
since existing APIs expect it passed as void *.
Porting Notes:
- Many of these fixes have already been applied to Linux. For
consistency the OpenZFS version of a change was applied if the
warning was addressed in an equivalent but different fashion.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Authored by: Alan Somers <asomers@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/8081
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/843abe1b8aCloses#6787
Several issues were uncovered by running stress tests with zfs
encryption and raw sends in particular. The issues and their
associated fixes are as follows:
* arc_read_done() has the ability to chain several requests for
the same block of data via the arc_callback_t struct. In these
cases, the ARC would only use the first request's dsobj from
the bookmark to decrypt the data. This is problematic because
the first request might be a prefetch zio which is able to
handle the key not being loaded, while the second might use a
different key that it is sure will work. The fix here is to
pass the dsobj with each individual arc_callback_t so that each
request can attempt to decrypt the data separately.
* DRR_FREE and DRR_FREEOBJECT records in a send file were not
having their transactions properly tagged as raw during raw
sends, which caused a panic when the dbuf code attempted to
decrypt these blocks.
* traverse_prefetch_metadata() did not properly set
ZIO_FLAG_SPECULATIVE when issuing prefetch IOs.
* Added a few asserts and code cleanups to ensure these issues
are more detectable in the future.
Signed-off-by: Tom Caputi <tcaputi@datto.com>
* PBKDF2 implementation changed to OpenSSL implementation.
* HKDF implementation moved to its own file and tests
added to ensure correctness.
* Removed libzfs's now unnecessary dependency on libzpool
and libicp.
* Ztest can now create and test encrypted datasets. This is
currently disabled until issue #6526 is resolved, but
otherwise functions as advertised.
* Several small bug fixes discovered after enabling ztest
to run on encrypted datasets.
* Fixed coverity defects added by the encryption patch.
* Updated man pages for encrypted send / receive behavior.
* Fixed a bug where encrypted datasets could receive
DRR_WRITE_EMBEDDED records.
* Minor code cleanups / consolidation.
Signed-off-by: Tom Caputi <tcaputi@datto.com>
With the addition of the ABD changes consumption of the virtual
address space has been greatly reduced. This exposed an issue on
CONFIG_HIGHMEM systems where free memory was being calculated
incorrectly. Functionally this didn't cause any major problems
prior to ABD because a lack of available virtual address space
was used as an indicator of low memory.
This patch makes the following changes to address the issue and
in the process realigns the code further with OpenZFS. There
are no substantive changes in behavior for 64-bit systems.
* Added CONFIG_HIGHMEM case to the arc_all_memory() and
arc_free_memory() functions to only consider low memory pages
on CONFIG_HIGHMEM systems.
* The arc_free_memory() function was updated to return bytes
instead of pages to be consistent with the other helper
functions. In user space we make up some reasonable values
since currently only testing is performed in this context.
* Adds three new values to the arcstats kstat to provide visibility
in to the ARC's assessment of the memory situation:
memory_all_bytes, memory_free_bytes, and memory_available_bytes.
* Added kmem_reap() call to arc_available_memory() for 32-bit
builds to realign code with OpenZFS.
* Reduced size of test file in /async_destroy_001_pos.ksh to
speed up test case. Multiple txgs are still required.
* Move vdevs used by zpool_clear_001_pos and zpool_upgrade_002_pos
to TEST_BASE_DIR location to speed up test cases.
Reviewed-by: David Quigley <david.quigley@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5352Closes#6734
Commit d3c2ae1 introduced a dbuf cache with a default size of the
minimum of 100M or 1/32 maximum ARC size. (These figures may be adjusted
using dbuf_cache_max_bytes and dbuf_cache_max_shift.) The dbuf cache
is counted as metadata for the purposes of ARC size calculations.
On a 1GB box the ARC maximum size defaults to c_max 493M which gives a
dbuf cache default minimum size of 15.4M, and the ARC metadata defaults
to minimum 16M. I.e. the dbuf cache is an significant proportion of the
minimum metadata size. With other overheads involved this actually means
the ARC metadata doesn't get down to the minimum.
This patch dynamically scales the dbuf cache to the target ARC size
instead of statically scaling it to the maximum ARC size. (The scale is
still set by dbuf_cache_max_shift and the maximum size is still fixed by
dbuf_cache_max_bytes.) Using the target ARC size rather than the current
ARC size is done to help the ARC reach the target rather than simply
focusing on the current size.
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #6506Closes#6561
Increase the default arc_c_min value to which whichever is larger,
either 32M or 1/32 of total system memory. This is advantageous for
systems with more than 1G of memory where performance issues may
occur when the ARC is allowed to collapse below a minimum size.
At the same time we want to use the bare minimum value which is
still functional so the filesystem can be used in very low memory
environments.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6659
generic_start_io_acct/generic_end_io_acct in the master
branch of the linux kernel requires that the request_queue
be provided.
Move the logic from freemem in the spl to arc_free_memory
in arc.c. Do this so we can take advantage of global_page_state
interface checks in zfs.
Upstream kernel replaced struct block_device with
struct gendisk in struct bio. Determine if the
function bio_set_dev exists during configure
and have zfs use that if it exists.
bio_set_dev https://github.com/torvalds/linux/commit/74d4699
global_node_page_state https://github.com/torvalds/linux/commit/75ef718
io acct https://github.com/torvalds/linux/commit/d62e26b
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes#6635
vm_node_stat must be used instead of vm_zone_stat. Unfortunately the
old code still compiles potentially leading to silent failure of
arc_evictable_memory()
AKAMAI: CR 3816601: Regression in zfs dropcache test
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Closes#6528
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#494Closes#5769
* Simplify threads, mutexs, cvs and rwlocks
* Update the zk_thread_create() function to use the same trick
as Illumos. Specifically, cast the new pthread_t to a void
pointer and return that as the kthread_t *. This avoids the
issues associated with managing a wrapper structure and is
safe as long as the callers never attempt to dereference it.
* Update all function prototypes passed to pthread_create() to
match the expected prototype. We were getting away this with
before since the function were explicitly cast.
* Replaced direct zk_thread_create() calls with thread_create()
for code consistency. All consumers of libzpool now use the
proper wrappers.
* The mutex_held() calls were converted to MUTEX_HELD().
* Removed all mutex_owner() calls and retired the interface.
Instead use MUTEX_HELD() which provides the same information
and allows the implementation details to be hidden. In this
case the use of the pthread_equals() function.
* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
any non essential fields removed. In the case of kthread_t
and kcondvar_t they could be directly typedef'd to pthread_t
and pthread_cond_t respectively.
* Removed all extra ASSERTS from the thread, mutex, rwlock, and
cv wrapper functions. In practice, pthreads already provides
the vast majority of checks as long as we check the return
code. Removing this code from our wrappers help readability.
* Added TS_JOINABLE state flag to pass to request a joinable rather
than detached thread. This isn't a standard thread_create() state
but it's the least invasive way to pass this information and is
only used by ztest.
TEST_ZTEST_TIMEOUT=3600
Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4547Closes#5503Closes#5523Closes#6377Closes#6495
Update many return and assignment statements to follow the convention
of using the SET_ERROR macro when returning a hard-coded non-zero
value from a function. This aids debugging by recording the error
codes in the debug log.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#6441
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7910
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cb6af4bCloses#6291
Make zfs_arc_meta_limit_percent and zfs_arc_dnode_limit_percent behave
as you would expect from zfs-module-parameters.5.
- recalculate arc_meta_limit if zfs_arc_meta_limit_percent changes
- recalculate arc_dnode_limit if zfs_arc_dnode_limit_percent changes
- correctly set arc_meta_limit and arc_dnode_limit if zfs_arc_max or
zfs_arc_meta_min changes
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes#6269
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
When writing pre-compressed buffers, arc_write() requires that
the compression algorithm used to compress the buffer matches
the compression algorithm requested by the zio_prop_t, which is
set by dmu_write_policy(). This makes dmu_write_policy() and its
callers a bit more complicated.
We simplify this by making arc_write() trust the caller to supply
the type of pre-compressed buffer that it wants to write,
and override the compression setting in the zio_prop_t.
OpenZFS-issue: https://www.illumos.org/issues/8155
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b55ff58Closes#6200
The arc layer tracks checksums of its data in the arc header
so that it can ensure that buffers haven't changed when they're
not supposed to. This checksum is only maintained while there
is an uncompressed buffer still attached to the header.
Unfortunately there is a missing call to arc_free_cksum() in
arc_release() that can trigger ASSERTs. This has not been a
common issue because the checksums are only maintained for
debug builds and triggering the bug requires writing a block
(and therefore calling arc_release()) while a compressed buffer
is still being used on a debug build. This simply corrects the
issue.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#6105
When multiple filesystems are in use, memory pressure causes arc_cache
to collapse to a minimum. Allow arc_cache to maintain proportional size
even when hit rates are disproportionate. We do this only via evictable
size from the kernel shrinker, thus it's only in effect under memory
pressure.
AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Closes#6035
Could return the wrong pages value
AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
Calling it when nothing is evictable will cause extra kswapd cpu. Also
if we didn't shrink it's unlikely to have memory to reap because we
likely just called it microseconds ago. The exception is if we are in
direct reclaim.
You can see how hard this is being hit in kswapd with a light test
workload:
34.95% [zfs] [k] arc_kmem_reap_now
5.40% [spl] [k] spl_kmem_cache_reap_now
3.79% [kernel] [k] _raw_spin_lock
2.86% [spl] [k] __spl_kmem_cache_generic_shrinker.isra.7
2.70% [kernel] [k] shrink_slab.part.37
1.93% [kernel] [k] isolate_lru_pages.isra.43
1.55% [kernel] [k] __wake_up_bit
1.20% [kernel] [k] super_cache_count
1.20% [kernel] [k] __radix_tree_lookup
With ZFS just mounted but only ext4/pagecache memory pressure
arc_kmem_reap_now still consumes excessive CPU:
12.69% [kernel] [k] isolate_lru_pages.isra.43
10.76% [kernel] [k] free_pcppages_bulk
7.98% [kernel] [k] drop_buffers
7.31% [kernel] [k] shrink_page_list
6.44% [zfs] [k] arc_kmem_reap_now
4.19% [kernel] [k] free_hot_cold_page
4.00% [kernel] [k] __slab_free
3.95% [kernel] [k] __isolate_lru_page
3.09% [kernel] [k] __radix_tree_lookup
Same pagecache only workload as above with this patch series:
11.58% [kernel] [k] isolate_lru_pages.isra.43
11.20% [kernel] [k] drop_buffers
9.67% [kernel] [k] free_pcppages_bulk
8.44% [kernel] [k] shrink_page_list
4.86% [kernel] [k] __isolate_lru_page
4.43% [kernel] [k] free_hot_cold_page
4.00% [kernel] [k] __slab_free
3.44% [kernel] [k] __radix_tree_lookup
(arc_kmem_reap_now has 0 samples in perf)
AKAMAI: zfs: CR 3695042
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
Lock contention, by itself, shouldn't indicate a stop condition to the
kernel's slab shrinker. Doing so can cause stalls when the kernel is
trying to free large parts of the cache such as is done by drop_caches
Also, perhaps arc_reclaim_lock should be a spinlock, and this code
eliminated.
AKAMAI: zfs: CR 3593801
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
Move arcstat_need_free increment from all direct calls to when
arc_reclaim_lock is busy and we exit wihout doing anything. Data will
be reclaimed in reclaim thread. The previous location meant that we
both reclaim the memory in this thread, and also schedule the same
amount of memory for reclaim in arc_reclaim, effectively doubling the
requested reclaim.
AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
Ensures proper accounting of bytes we requested to free
AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
Ghost meta/data buffers are not actually allocated
AKAMAI: zfs: CR 3695072
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Issue #6035
OpenZFS 7252 - compressed zfs send / receive
OpenZFS 7628 - create long versions of ZFS send / receive options
Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Ported-by: bunder2015 <omfgbunder@gmail.com>
Ported-by: Don Brady <don.brady@intel.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:
- Most of 7252 was already picked up during ABD work. This
commit represents the gap from the final commit to openzfs.
- Fixed split_large_blocks check in do_dump()
- An alternate version of the write_compressible() function was
implemented for Linux which does not depend on fio. The behavior
of fio differs significantly based on the exact version.
- mkholes was replaced with truncate for Linux.
OpenZFS-issue: https://www.illumos.org/issues/7252
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5602294Closes#6067
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Matthew Ahrens <mahrens@delphix.com>
spa_sync() iterates over all the dirty dnodes and processes each of them
by calling dnode_sync(). If there are many dirty dnodes (e.g. because we
created or removed a lot of files), the single thread of spa_sync()
calling dnode_sync() can become a bottleneck. Additionally, if many
dnodes are dirtied concurrently in open context (e.g. due to concurrent
file creation), the os_lock will experience lock contention via
dnode_setdirty().
The solution is to track dirty dnodes on a multilist_t, and for
spa_sync() to use separate threads to process each of the sublists in
the multilist.
OpenZFS-issue: https://www.illumos.org/issues/7968
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4a2a54cCloses#5752
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7867
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aa1f740dCloses#5874