mirror_zfs/include/sys/zfs_context.h

314 lines
7.6 KiB
C
Raw Normal View History

// SPDX-License-Identifier: CDDL-1.0
2008-12-11 22:14:49 +03:00
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or https://opensource.org/licenses/CDDL-1.0.
2008-12-11 22:14:49 +03:00
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
* Copyright 2011 Nexenta Systems, Inc. All rights reserved.
Implement Redacted Send/Receive Redacted send/receive allows users to send subsets of their data to a target system. One possible use case for this feature is to not transmit sensitive information to a data warehousing, test/dev, or analytics environment. Another is to save space by not replicating unimportant data within a given dataset, for example in backup tools like zrepl. Redacted send/receive is a three-stage process. First, a clone (or clones) is made of the snapshot to be sent to the target. In this clone (or clones), all unnecessary or unwanted data is removed or modified. This clone is then snapshotted to create the "redaction snapshot" (or snapshots). Second, the new zfs redact command is used to create a redaction bookmark. The redaction bookmark stores the list of blocks in a snapshot that were modified by the redaction snapshot(s). Finally, the redaction bookmark is passed as a parameter to zfs send. When sending to the snapshot that was redacted, the redaction bookmark is used to filter out blocks that contain sensitive or unwanted information, and those blocks are not included in the send stream. When sending from the redaction bookmark, the blocks it contains are considered as candidate blocks in addition to those blocks in the destination snapshot that were modified since the creation_txg of the redaction bookmark. This step is necessary to allow the target to rehydrate data in the case where some blocks are accidentally or unnecessarily modified in the redaction snapshot. The changes to bookmarks to enable fast space estimation involve adding deadlists to bookmarks. There is also logic to manage the life cycles of these deadlists. The new size estimation process operates in cases where previously an accurate estimate could not be provided. In those cases, a send is performed where no data blocks are read, reducing the runtime significantly and providing a byte-accurate size estimate. Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Prashanth Sreenivasa <pks@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Chris Williamson <chris.williamson@delphix.com> Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com> Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #7958
2019-06-19 19:48:13 +03:00
* Copyright (c) 2012, 2018 by Delphix. All rights reserved.
* Copyright (c) 2012, Joyent, Inc. All rights reserved.
*/
2008-12-11 22:14:49 +03:00
#ifndef _SYS_ZFS_CONTEXT_H
#define _SYS_ZFS_CONTEXT_H
#ifdef __cplusplus
extern "C" {
#endif
/*
* This code compiles in three different contexts. When __KERNEL__ is defined,
* the code uses "unix-like" kernel interfaces. When _STANDALONE is defined, the
* code is running in a reduced capacity environment of the boot loader which is
* generally a subset of both POSIX and kernel interfaces (with a few unique
* interfaces too). When neither are defined, it's in a userland POSIX or
* similar environment.
*/
#if defined(__KERNEL__) || defined(_STANDALONE)
Support custom build directories and move includes One of the neat tricks an autoconf style project is capable of is allow configurion/building in a directory other than the source directory. The major advantage to this is that you can build the project various different ways while making changes in a single source tree. For example, this project is designed to work on various different Linux distributions each of which work slightly differently. This means that changes need to verified on each of those supported distributions perferably before the change is committed to the public git repo. Using nfs and custom build directories makes this much easier. I now have a single source tree in nfs mounted on several different systems each running a supported distribution. When I make a change to the source base I suspect may break things I can concurrently build from the same source on all the systems each in their own subdirectory. wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz tar -xzf zfs-x.y.z.tar.gz cd zfs-x-y-z ------------------------- run concurrently ---------------------- <ubuntu system> <fedora system> <debian system> <rhel6 system> mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6 cd ubuntu cd fedora cd debian cd rhel6 ../configure ../configure ../configure ../configure make make make make make check make check make check make check This change also moves many of the include headers from individual incude/sys directories under the modules directory in to a single top level include directory. This has the advantage of making the build rules cleaner and logically it makes a bit more sense.
2010-09-05 00:26:23 +04:00
#include <sys/types.h>
#include <sys/atomic.h>
#include <sys/sysmacros.h>
Update build system and packaging Minimal changes required to integrate the SPL sources in to the ZFS repository build infrastructure and packaging. Build system and packaging: * Renamed SPL_* autoconf m4 macros to ZFS_*. * Removed redundant SPL_* autoconf m4 macros. * Updated the RPM spec files to remove SPL package dependency. * The zfs package obsoletes the spl package, and the zfs-kmod package obsoletes the spl-kmod package. * The zfs-kmod-devel* packages were updated to add compatibility symlinks under /usr/src/spl-x.y.z until all dependent packages can be updated. They will be removed in a future release. * Updated copy-builtin script for in-kernel builds. * Updated DKMS package to include the spl.ko. * Updated stale AUTHORS file to include all contributors. * Updated stale COPYRIGHT and included the SPL as an exception. * Renamed README.markdown to README.md * Renamed OPENSOLARIS.LICENSE to LICENSE. * Renamed DISCLAIMER to NOTICE. Required code changes: * Removed redundant HAVE_SPL macro. * Removed _BOOT from nvpairs since it doesn't apply for Linux. * Initial header cleanup (removal of empty headers, refactoring). * Remove SPL repository clone/build from zimport.sh. * Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due to build issues when forcing C99 compilation. * Replaced legacy ACCESS_ONCE with READ_ONCE. * Include needed headers for `current` and `EXPORT_SYMBOL`. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> TEST_ZIMPORT_SKIP="yes" Closes #7556
2018-02-16 04:53:18 +03:00
#include <sys/vmsystm.h>
#include <sys/condvar.h>
Support custom build directories and move includes One of the neat tricks an autoconf style project is capable of is allow configurion/building in a directory other than the source directory. The major advantage to this is that you can build the project various different ways while making changes in a single source tree. For example, this project is designed to work on various different Linux distributions each of which work slightly differently. This means that changes need to verified on each of those supported distributions perferably before the change is committed to the public git repo. Using nfs and custom build directories makes this much easier. I now have a single source tree in nfs mounted on several different systems each running a supported distribution. When I make a change to the source base I suspect may break things I can concurrently build from the same source on all the systems each in their own subdirectory. wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz tar -xzf zfs-x.y.z.tar.gz cd zfs-x-y-z ------------------------- run concurrently ---------------------- <ubuntu system> <fedora system> <debian system> <rhel6 system> mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6 cd ubuntu cd fedora cd debian cd rhel6 ../configure ../configure ../configure ../configure make make make make make check make check make check make check This change also moves many of the include headers from individual incude/sys directories under the modules directory in to a single top level include directory. This has the advantage of making the build rules cleaner and logically it makes a bit more sense.
2010-09-05 00:26:23 +04:00
#include <sys/cmn_err.h>
#include <sys/kmem.h>
#include <sys/kmem_cache.h>
#include <sys/vmem.h>
#include <sys/misc.h>
Support custom build directories and move includes One of the neat tricks an autoconf style project is capable of is allow configurion/building in a directory other than the source directory. The major advantage to this is that you can build the project various different ways while making changes in a single source tree. For example, this project is designed to work on various different Linux distributions each of which work slightly differently. This means that changes need to verified on each of those supported distributions perferably before the change is committed to the public git repo. Using nfs and custom build directories makes this much easier. I now have a single source tree in nfs mounted on several different systems each running a supported distribution. When I make a change to the source base I suspect may break things I can concurrently build from the same source on all the systems each in their own subdirectory. wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz tar -xzf zfs-x.y.z.tar.gz cd zfs-x-y-z ------------------------- run concurrently ---------------------- <ubuntu system> <fedora system> <debian system> <rhel6 system> mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6 cd ubuntu cd fedora cd debian cd rhel6 ../configure ../configure ../configure ../configure make make make make make check make check make check make check This change also moves many of the include headers from individual incude/sys directories under the modules directory in to a single top level include directory. This has the advantage of making the build rules cleaner and logically it makes a bit more sense.
2010-09-05 00:26:23 +04:00
#include <sys/taskq.h>
#include <sys/param.h>
#include <sys/disp.h>
#include <sys/debug.h>
#include <sys/random.h>
#include <sys/string.h>
Support custom build directories and move includes One of the neat tricks an autoconf style project is capable of is allow configurion/building in a directory other than the source directory. The major advantage to this is that you can build the project various different ways while making changes in a single source tree. For example, this project is designed to work on various different Linux distributions each of which work slightly differently. This means that changes need to verified on each of those supported distributions perferably before the change is committed to the public git repo. Using nfs and custom build directories makes this much easier. I now have a single source tree in nfs mounted on several different systems each running a supported distribution. When I make a change to the source base I suspect may break things I can concurrently build from the same source on all the systems each in their own subdirectory. wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz tar -xzf zfs-x.y.z.tar.gz cd zfs-x-y-z ------------------------- run concurrently ---------------------- <ubuntu system> <fedora system> <debian system> <rhel6 system> mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6 cd ubuntu cd fedora cd debian cd rhel6 ../configure ../configure ../configure ../configure make make make make make check make check make check make check This change also moves many of the include headers from individual incude/sys directories under the modules directory in to a single top level include directory. This has the advantage of making the build rules cleaner and logically it makes a bit more sense.
2010-09-05 00:26:23 +04:00
#include <sys/byteorder.h>
#include <sys/list.h>
#include <sys/time.h>
#include <sys/zone.h>
#include <sys/kstat.h>
Support custom build directories and move includes One of the neat tricks an autoconf style project is capable of is allow configurion/building in a directory other than the source directory. The major advantage to this is that you can build the project various different ways while making changes in a single source tree. For example, this project is designed to work on various different Linux distributions each of which work slightly differently. This means that changes need to verified on each of those supported distributions perferably before the change is committed to the public git repo. Using nfs and custom build directories makes this much easier. I now have a single source tree in nfs mounted on several different systems each running a supported distribution. When I make a change to the source base I suspect may break things I can concurrently build from the same source on all the systems each in their own subdirectory. wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz tar -xzf zfs-x.y.z.tar.gz cd zfs-x-y-z ------------------------- run concurrently ---------------------- <ubuntu system> <fedora system> <debian system> <rhel6 system> mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6 cd ubuntu cd fedora cd debian cd rhel6 ../configure ../configure ../configure ../configure make make make make make check make check make check make check This change also moves many of the include headers from individual incude/sys directories under the modules directory in to a single top level include directory. This has the advantage of making the build rules cleaner and logically it makes a bit more sense.
2010-09-05 00:26:23 +04:00
#include <sys/zfs_debug.h>
OpenZFS 5997 - FRU field not set during pool creation and never updated Authored by: Hans Rosenfeld <hans.rosenfeld@nexenta.com> Reviewed by: Dan Fields <dan.fields@nexenta.com> Reviewed by: Josef Sipek <josef.sipek@nexenta.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Signed-off-by: Don Brady <don.brady@intel.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/5997 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1437283 Porting Notes: In addition to the OpenZFS changes this patch realigns the events with those found in OpenZFS. Events which would be logged as sysevents on illumos have been been mapped to the 'sysevent' class for Linux. In addition, several subclass names have been changed to match what is used in OpenZFS. In all cases this means a '.' was changed to an '_' in the subclass. The scripts provided by ZoL have been updated, however users which provide scripts for any of the following events will need to rename them based on the new subclass names. ereport.fs.zfs.config.sync sysevent.fs.zfs.config_sync ereport.fs.zfs.zpool.destroy sysevent.fs.zfs.pool_destroy ereport.fs.zfs.zpool.reguid sysevent.fs.zfs.pool_reguid ereport.fs.zfs.vdev.remove sysevent.fs.zfs.vdev_remove ereport.fs.zfs.vdev.clear sysevent.fs.zfs.vdev_clear ereport.fs.zfs.vdev.check sysevent.fs.zfs.vdev_check ereport.fs.zfs.vdev.spare sysevent.fs.zfs.vdev_spare ereport.fs.zfs.vdev.autoexpand sysevent.fs.zfs.vdev_autoexpand ereport.fs.zfs.resilver.start sysevent.fs.zfs.resilver_start ereport.fs.zfs.resilver.finish sysevent.fs.zfs.resilver_finish ereport.fs.zfs.scrub.start sysevent.fs.zfs.scrub_start ereport.fs.zfs.scrub.finish sysevent.fs.zfs.scrub_finish ereport.fs.zfs.bootfs.vdev.attach sysevent.fs.zfs.bootfs_vdev_attach
2016-07-28 01:29:15 +03:00
#include <sys/sysevent.h>
#include <sys/sysevent/eventdefs.h>
Illumos #4045 write throttle & i/o scheduler performance work 4045 zfs write throttle & i/o scheduler performance work 1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync read, sync write, async read, async write, and scrub/resilver. The scheduler issues a number of concurrent i/os from each class to the device. Once a class has been selected, an i/o is selected from this class using either an elevator algorithem (async, scrub classes) or FIFO (sync classes). The number of concurrent async write i/os is tuned dynamically based on i/o load, to achieve good sync i/o latency when there is not a high load of writes, and good write throughput when there is. See the block comment in vdev_queue.c (reproduced below) for more details. 2. The write throttle (dsl_pool_tempreserve_space() and txg_constrain_throughput()) is rewritten to produce much more consistent delays when under constant load. The new write throttle is based on the amount of dirty data, rather than guesses about future performance of the system. When there is a lot of dirty data, each transaction (e.g. write() syscall) will be delayed by the same small amount. This eliminates the "brick wall of wait" that the old write throttle could hit, causing all transactions to wait several seconds until the next txg opens. One of the keys to the new write throttle is decrementing the amount of dirty data as i/o completes, rather than at the end of spa_sync(). Note that the write throttle is only applied once the i/o scheduler is issuing the maximum number of outstanding async writes. See the block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for more details. This diff has several other effects, including: * the commonly-tuned global variable zfs_vdev_max_pending has been removed; use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead. * the size of each txg (meaning the amount of dirty data written, and thus the time it takes to write out) is now controlled differently. There is no longer an explicit time goal; the primary determinant is amount of dirty data. Systems that are under light or medium load will now often see that a txg is always syncing, but the impact to performance (e.g. read latency) is minimal. Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this. * zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression, checksum, etc. This improves latency by not allowing these CPU-intensive tasks to consume all CPU (on machines with at least 4 CPU's; the percentage is rounded up). --matt APPENDIX: problems with the current i/o scheduler The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem with this is that if there are always i/os pending, then certain classes of i/os can see very long delays. For example, if there are always synchronous reads outstanding, then no async writes will be serviced until they become "past due". One symptom of this situation is that each pass of the txg sync takes at least several seconds (typically 3 seconds). If many i/os become "past due" (their deadline is in the past), then we must service all of these overdue i/os before any new i/os. This happens when we enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in the future. If we can't complete all the i/os in 2.5 seconds (e.g. because there were always reads pending), then these i/os will become past due. Now we must service all the "async" writes (which could be hundreds of megabytes) before we service any reads, introducing considerable latency to synchronous i/os (reads or ZIL writes). Notes on porting to ZFS on Linux: - zio_t gained new members io_physdone and io_phys_children. Because object caches in the Linux port call the constructor only once at allocation time, objects may contain residual data when retrieved from the cache. Therefore zio_create() was updated to zero out the two new fields. - vdev_mirror_pending() relied on the depth of the per-vdev pending queue (vq->vq_pending_tree) to select the least-busy leaf vdev to read from. This tree has been replaced by vq->vq_active_tree which is now used for the same purpose. - vdev_queue_init() used the value of zfs_vdev_max_pending to determine the number of vdev I/O buffers to pre-allocate. That global no longer exists, so we instead use the sum of the *_max_active values for each of the five I/O classes described above. - The Illumos implementation of dmu_tx_delay() delays a transaction by sleeping in condition variable embedded in the thread (curthread->t_delay_cv). We do not have an equivalent CV to use in Linux, so this change replaced the delay logic with a wrapper called zfs_sleep_until(). This wrapper could be adopted upstream and in other downstream ports to abstract away operating system-specific delay logic. - These tunables are added as module parameters, and descriptions added to the zfs-module-parameters.5 man page. spa_asize_inflation zfs_deadman_synctime_ms zfs_vdev_max_active zfs_vdev_async_write_active_min_dirty_percent zfs_vdev_async_write_active_max_dirty_percent zfs_vdev_async_read_max_active zfs_vdev_async_read_min_active zfs_vdev_async_write_max_active zfs_vdev_async_write_min_active zfs_vdev_scrub_max_active zfs_vdev_scrub_min_active zfs_vdev_sync_read_max_active zfs_vdev_sync_read_min_active zfs_vdev_sync_write_max_active zfs_vdev_sync_write_min_active zfs_dirty_data_max_percent zfs_delay_min_dirty_percent zfs_dirty_data_max_max_percent zfs_dirty_data_max zfs_dirty_data_max_max zfs_dirty_data_sync zfs_delay_scale The latter four have type unsigned long, whereas they are uint64_t in Illumos. This accommodates Linux's module_param() supported types, but means they may overflow on 32-bit architectures. The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most likely to overflow on 32-bit systems, since they express physical RAM sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to 2^32 which does overflow. To resolve that, this port instead initializes it in arc_init() to 25% of physical RAM, and adds the tunable zfs_dirty_data_max_max_percent to override that percentage. While this solution doesn't completely avoid the overflow issue, it should be a reasonable default for most systems, and the minority of affected systems can work around the issue by overriding the defaults. - Fixed reversed logic in comment above zfs_delay_scale declaration. - Clarified comments in vdev_queue.c regarding when per-queue minimums take effect. - Replaced dmu_tx_write_limit in the dmu_tx kstat file with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts how many times a transaction has been delayed because the pool dirty data has exceeded zfs_delay_min_dirty_percent. The latter counts how many times the pool dirty data has exceeded zfs_dirty_data_max (which we expect to never happen). - The original patch would have regressed the bug fixed in zfsonlinux/zfs@c418410, which prevented users from setting the zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE. A similar fix is added to vdev_queue_aggregate(). - In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the heap instead of the stack. In Linux we can't afford such large structures on the stack. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Ned Bass <bass6@llnl.gov> Reviewed by: Brendan Gregg <brendan.gregg@joyent.com> Approved by: Robert Mustacchi <rm@joyent.com> References: http://www.illumos.org/issues/4045 illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e Ported-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1913
2013-08-29 07:01:20 +04:00
#include <sys/zfs_delay.h>
Support custom build directories and move includes One of the neat tricks an autoconf style project is capable of is allow configurion/building in a directory other than the source directory. The major advantage to this is that you can build the project various different ways while making changes in a single source tree. For example, this project is designed to work on various different Linux distributions each of which work slightly differently. This means that changes need to verified on each of those supported distributions perferably before the change is committed to the public git repo. Using nfs and custom build directories makes this much easier. I now have a single source tree in nfs mounted on several different systems each running a supported distribution. When I make a change to the source base I suspect may break things I can concurrently build from the same source on all the systems each in their own subdirectory. wget -c http://github.com/downloads/behlendorf/zfs/zfs-x.y.z.tar.gz tar -xzf zfs-x.y.z.tar.gz cd zfs-x-y-z ------------------------- run concurrently ---------------------- <ubuntu system> <fedora system> <debian system> <rhel6 system> mkdir ubuntu mkdir fedora mkdir debian mkdir rhel6 cd ubuntu cd fedora cd debian cd rhel6 ../configure ../configure ../configure ../configure make make make make make check make check make check make check This change also moves many of the include headers from individual incude/sys directories under the modules directory in to a single top level include directory. This has the advantage of making the build rules cleaner and logically it makes a bit more sense.
2010-09-05 00:26:23 +04:00
#include <sys/sunddi.h>
#include <sys/ctype.h>
#include <sys/disp.h>
Swap DTRACE_PROBE* with Linux tracepoints This patch leverages Linux tracepoints from within the ZFS on Linux code base. It also refactors the debug code to bring it back in sync with Illumos. The information exported via tracepoints can be used for a variety of reasons (e.g. debugging, tuning, general exploration/understanding, etc). It is advantageous to use Linux tracepoints as the mechanism to export this kind of information (as opposed to something else) for a number of reasons: * A number of external tools can make use of our tracepoints "automatically" (e.g. perf, systemtap) * Tracepoints are designed to be extremely cheap when disabled * It's one of the "accepted" ways to export this kind of information; many other kernel subsystems use tracepoints too. Unfortunately, though, there are a few caveats as well: * Linux tracepoints appear to only be available to GPL licensed modules due to the way certain kernel functions are exported. Thus, to actually make use of the tracepoints introduced by this patch, one might have to patch and re-compile the kernel; exporting the necessary functions to non-GPL modules. * Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux tracepoints are not available for unsigned kernel modules (tracepoints will get disabled due to the module's 'F' taint). Thus, one either has to sign the zfs kernel module prior to loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or newer. Assuming the above two requirements are satisfied, lets look at an example of how this patch can be used and what information it exposes (all commands run as 'root'): # list all zfs tracepoints available $ ls /sys/kernel/debug/tracing/events/zfs enable filter zfs_arc__delete zfs_arc__evict zfs_arc__hit zfs_arc__miss zfs_l2arc__evict zfs_l2arc__hit zfs_l2arc__iodone zfs_l2arc__miss zfs_l2arc__read zfs_l2arc__write zfs_new_state__mfu zfs_new_state__mru # enable all zfs tracepoints, clear the tracepoint ring buffer $ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable $ echo 0 > /sys/kernel/debug/tracing/trace # import zpool called 'tank', inspect tracepoint data (each line was # truncated, they're too long for a commit message otherwise) $ zpool import tank $ cat /sys/kernel/debug/tracing/trace | head -n35 # tracer: nop # # entries-in-buffer/entries-written: 1219/1219 #P:8 # # _-----=> irqs-off # / _----=> need-resched # | / _---=> hardirq/softirq # || / _--=> preempt-depth # ||| / delay # TASK-PID CPU# |||| TIMESTAMP FUNCTION # | | | |||| | | lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr... z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr... z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr... z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ... lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr... z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr... z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru... lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr... z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru... lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr... z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru... lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ... lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ... To highlight the kind of detailed information that is being exported using this infrastructure, I've taken the first tracepoint line from the output above and reformatted it such that it fits in 80 columns: lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr { dva 0x1:0x40082 birth 15491 cksum0 0x163edbff3a flags 0x640 datacnt 1 type 1 size 2048 spa 3133524293419867460 state_type 0 access 0 mru_hits 0 mru_ghost_hits 0 mfu_hits 0 mfu_ghost_hits 0 l2_hits 0 refcount 1 } bp { dva0 0x1:0x40082 dva1 0x1:0x3000e5 dva2 0x1:0x5a006e cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00 lsize 2048 } zb { objset 0 object 0 level -1 blkid 0 } For the specific tracepoint shown here, 'zfs_arc__miss', data is exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!). This kind of precise and detailed information can be extremely valuable when trying to answer certain kinds of questions. For anybody unfamiliar but looking to build on this, I found the XFS source code along with the following three web links to be extremely helpful: * http://lwn.net/Articles/379903/ * http://lwn.net/Articles/381064/ * http://lwn.net/Articles/383362/ I should also node the more "boring" aspects of this patch: * The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to support a sixth paramter. This parameter is used to populate the contents of the new conftest.h file. If no sixth parameter is provided, conftest.h will be empty. * The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced. This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro, except it has support for a fifth option that is then passed as the sixth parameter to ZFS_LINUX_COMPILE_IFELSE. These autoconf changes were needed to test the availability of the Linux tracepoint macros. Due to the odd nature of the Linux tracepoint macro API, a separate ".h" must be created (the path and filename is used internally by the kernel's define_trace.h file). * The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This is to determine if we can safely enable the Linux tracepoint functionality. We need to selectively disable the tracepoint code due to the kernel exporting certain functions as GPL only. Without this check, the build process will fail at link time. In addition, the SET_ERROR macro was modified into a tracepoint as well. To do this, the 'sdt.h' file was moved into the 'include/sys' directory and now contains a userspace portion and a kernel space portion. The dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as well. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-06-13 21:54:48 +04:00
#include <sys/trace.h>
#include <sys/procfs_list.h>
#include <sys/mod.h>
#include <sys/uio_impl.h>
#include <sys/zfs_context_os.h>
#else /* _KERNEL || _STANDALONE */
2008-12-11 22:14:49 +03:00
#define _SYS_VFS_H
#define _SYS_SUNDDI_H
#define _SYS_CALLB_H
#include <stdio.h>
#include <stdlib.h>
#include <stddef.h>
#include <stdarg.h>
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
#include <setjmp.h>
2008-12-11 22:14:49 +03:00
#include <assert.h>
#include <limits.h>
#include <atomic.h>
#include <dirent.h>
#include <time.h>
#include <ctype.h>
#include <signal.h>
#include <sys/mman.h>
2008-12-11 22:14:49 +03:00
#include <sys/types.h>
#include <sys/cred.h>
#include <sys/sysmacros.h>
#include <sys/resource.h>
#include <sys/byteorder.h>
#include <sys/list.h>
#include <sys/mod.h>
2008-12-11 22:14:49 +03:00
#include <sys/uio.h>
#include <sys/zfs_debug.h>
#include <sys/kstat.h>
#include <sys/u8_textprep.h>
OpenZFS 5997 - FRU field not set during pool creation and never updated Authored by: Hans Rosenfeld <hans.rosenfeld@nexenta.com> Reviewed by: Dan Fields <dan.fields@nexenta.com> Reviewed by: Josef Sipek <josef.sipek@nexenta.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Signed-off-by: Don Brady <don.brady@intel.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/5997 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1437283 Porting Notes: In addition to the OpenZFS changes this patch realigns the events with those found in OpenZFS. Events which would be logged as sysevents on illumos have been been mapped to the 'sysevent' class for Linux. In addition, several subclass names have been changed to match what is used in OpenZFS. In all cases this means a '.' was changed to an '_' in the subclass. The scripts provided by ZoL have been updated, however users which provide scripts for any of the following events will need to rename them based on the new subclass names. ereport.fs.zfs.config.sync sysevent.fs.zfs.config_sync ereport.fs.zfs.zpool.destroy sysevent.fs.zfs.pool_destroy ereport.fs.zfs.zpool.reguid sysevent.fs.zfs.pool_reguid ereport.fs.zfs.vdev.remove sysevent.fs.zfs.vdev_remove ereport.fs.zfs.vdev.clear sysevent.fs.zfs.vdev_clear ereport.fs.zfs.vdev.check sysevent.fs.zfs.vdev_check ereport.fs.zfs.vdev.spare sysevent.fs.zfs.vdev_spare ereport.fs.zfs.vdev.autoexpand sysevent.fs.zfs.vdev_autoexpand ereport.fs.zfs.resilver.start sysevent.fs.zfs.resilver_start ereport.fs.zfs.resilver.finish sysevent.fs.zfs.resilver_finish ereport.fs.zfs.scrub.start sysevent.fs.zfs.scrub_start ereport.fs.zfs.scrub.finish sysevent.fs.zfs.scrub_finish ereport.fs.zfs.bootfs.vdev.attach sysevent.fs.zfs.bootfs_vdev_attach
2016-07-28 01:29:15 +03:00
#include <sys/sysevent.h>
#include <sys/sysevent/eventdefs.h>
#include <sys/sunddi.h>
#include <sys/debug.h>
#include <sys/utsname.h>
#include <sys/trace_zfs.h>
2008-12-11 22:14:49 +03:00
#include <sys/mutex.h>
#include <sys/rwlock.h>
#include <sys/condvar.h>
#include <sys/cmn_err.h>
#include <sys/thread.h>
#include <sys/taskq.h>
#include <sys/tsd.h>
#include <sys/procfs_list.h>
#include <sys/kmem.h>
#include <sys/zfs_delay.h>
#include <sys/vnode.h>
#include <sys/zfs_context_os.h>
/*
* Stack
*/
#define noinline __attribute__((noinline))
Performance optimization of AVL tree comparator functions perf: 2.75x faster ddt_entry_compare() First 256bits of ddt_key_t is a block checksum, which are expected to be close to random data. Hence, on average, comparison only needs to look at first few bytes of the keys. To reduce number of conditional jump instructions, the result is computed as: sign(memcmp(k1, k2)). Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} , which is computed efficiently. Synthetic performance evaluation of original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R) CPU E5-2660 v3: old 6.85789 s new 2.49089 s perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare() Compute the result directly instead of using conditionals perf: zfs_range_compare() Speedup between 1.1x - 2.5x, depending on compiler version and optimization level. perf: spa_error_entry_compare() `bcmp()` is not suitable for comparator use. Use `memcmp()` instead. perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare() perf: 2.8x faster zil_bp_compare() perf: 2.8x faster mze_compare() perf: faster dbuf_compare() perf: faster compares in spa_misc perf: 2.8x faster layout_hash_compare() perf: 2.8x faster space_reftree_compare() perf: libzfs: faster avl tree comparators perf: guid_compare() perf: dsl_deadlist_compare() perf: perm_set_compare() perf: 2x faster range_tree_seg_compare() perf: faster unique_compare() perf: faster vdev_cache _compare() perf: faster vdev_uberblock_compare() perf: faster fuid _compare() perf: faster zfs_znode_hold_compare() Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Richard Elling <richard.elling@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5033
2016-08-27 21:12:53 +03:00
#define likely(x) __builtin_expect((x), 1)
Implement Redacted Send/Receive Redacted send/receive allows users to send subsets of their data to a target system. One possible use case for this feature is to not transmit sensitive information to a data warehousing, test/dev, or analytics environment. Another is to save space by not replicating unimportant data within a given dataset, for example in backup tools like zrepl. Redacted send/receive is a three-stage process. First, a clone (or clones) is made of the snapshot to be sent to the target. In this clone (or clones), all unnecessary or unwanted data is removed or modified. This clone is then snapshotted to create the "redaction snapshot" (or snapshots). Second, the new zfs redact command is used to create a redaction bookmark. The redaction bookmark stores the list of blocks in a snapshot that were modified by the redaction snapshot(s). Finally, the redaction bookmark is passed as a parameter to zfs send. When sending to the snapshot that was redacted, the redaction bookmark is used to filter out blocks that contain sensitive or unwanted information, and those blocks are not included in the send stream. When sending from the redaction bookmark, the blocks it contains are considered as candidate blocks in addition to those blocks in the destination snapshot that were modified since the creation_txg of the redaction bookmark. This step is necessary to allow the target to rehydrate data in the case where some blocks are accidentally or unnecessarily modified in the redaction snapshot. The changes to bookmarks to enable fast space estimation involve adding deadlists to bookmarks. There is also logic to manage the life cycles of these deadlists. The new size estimation process operates in cases where previously an accurate estimate could not be provided. In those cases, a send is performed where no data blocks are read, reducing the runtime significantly and providing a byte-accurate size estimate. Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Prashanth Sreenivasa <pks@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Chris Williamson <chris.williamson@delphix.com> Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com> Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #7958
2019-06-19 19:48:13 +03:00
#define unlikely(x) __builtin_expect((x), 0)
2008-12-11 22:14:49 +03:00
/*
* DTrace SDT probes have different signatures in userland than they do in
* the kernel. If they're being used in kernel code, re-define them out of
2008-12-11 22:14:49 +03:00
* existence for their counterparts in libzpool.
*
* Here's an example of how to use the set-error probes in userland:
* zfs$target:::set-error /arg0 == EBUSY/ {stack();}
*
* Here's an example of how to use DTRACE_PROBE probes in userland:
* If there is a probe declared as follows:
* DTRACE_PROBE2(zfs__probe_name, uint64_t, blkid, dnode_t *, dn);
* Then you can use it as follows:
* zfs$target:::probe2 /copyinstr(arg0) == "zfs__probe_name"/
* {printf("%u %p\n", arg1, arg2);}
2008-12-11 22:14:49 +03:00
*/
#ifdef DTRACE_PROBE
#undef DTRACE_PROBE
#endif /* DTRACE_PROBE */
#define DTRACE_PROBE(a)
2008-12-11 22:14:49 +03:00
#ifdef DTRACE_PROBE1
#undef DTRACE_PROBE1
#endif /* DTRACE_PROBE1 */
#define DTRACE_PROBE1(a, b, c)
2008-12-11 22:14:49 +03:00
#ifdef DTRACE_PROBE2
#undef DTRACE_PROBE2
#endif /* DTRACE_PROBE2 */
#define DTRACE_PROBE2(a, b, c, d, e)
2008-12-11 22:14:49 +03:00
#ifdef DTRACE_PROBE3
#undef DTRACE_PROBE3
#endif /* DTRACE_PROBE3 */
#define DTRACE_PROBE3(a, b, c, d, e, f, g)
2008-12-11 22:14:49 +03:00
#ifdef DTRACE_PROBE4
#undef DTRACE_PROBE4
#endif /* DTRACE_PROBE4 */
#define DTRACE_PROBE4(a, b, c, d, e, f, g, h, i)
2008-12-11 22:14:49 +03:00
#ifdef __FreeBSD__
typedef off_t loff_t;
#endif
extern char *vn_dumpdir;
2008-12-11 22:14:49 +03:00
/*
* Random stuff
*/
#define max_ncpus 64
Illumos 5376 - arc_kmem_reap_now() should not result in clearing arc_no_grow 5376 arc_kmem_reap_now() should not result in clearing arc_no_grow Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5376 https://github.com/illumos/illumos-gate/commit/2ec99e3 Porting Notes: The good news is that many of the recent changes made upstream to the ARC tackled issues previously observed by ZoL with similar solutions. The bad news is those solution weren't identical to the ones we applied. This patch is designed to split the difference and apply as much of the upstream work as possible. * The arc_available_memory() function was removed previous in ZoL but due to the upstream changes it makes sense to add it back. This function has been customized for Linux so that it can be used to determine a low memory. This provides the same basic functionality as the illumos version allowing us to minimize changes through the rest of the code base. The exact mechanism used to detect a low memory state remains unchanged so this change isn't a significant as it might first appear. * This patch includes the long standing fix for arc_shrink() which was originally proposed in #2167. Since there were related changes to this function it made sense to include that work. * The arc_init() function has been re-factored. As before it sets sane default values for the ARC but then calls arc_tuning_update() to apply user specific tuning made via module options. The arc_tuning_update() function is then called periodically by the arc_reclaim_thread() to apply changes to the tunings made during normal operation. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3616 Closes #2167
2015-06-26 21:28:18 +03:00
#define boot_ncpus (sysconf(_SC_NPROCESSORS_ONLN))
2008-12-11 22:14:49 +03:00
Align thread priority with Linux defaults Under Linux filesystem threads responsible for handling I/O are normally created with the maximum priority. Non-I/O filesystem processes run with the default priority. ZFS should adopt the same priority scheme under Linux to maintain good performance and so that it will complete fairly when other Linux filesystems are active. The priorities have been updated to the following: $ ps -eLo rtprio,cls,pid,pri,nice,cmd | egrep 'z_|spl_|zvol|arc|dbu|meta' - TS 10743 19 -20 [spl_kmem_cache] - TS 10744 19 -20 [spl_system_task] - TS 10745 19 -20 [spl_dynamic_tas] - TS 10764 19 0 [dbu_evict] - TS 10765 19 0 [arc_prune] - TS 10766 19 0 [arc_reclaim] - TS 10767 19 0 [arc_user_evicts] - TS 10768 19 0 [l2arc_feed] - TS 10769 39 0 [z_unmount] - TS 10770 39 -20 [zvol] - TS 11011 39 -20 [z_null_iss] - TS 11012 39 -20 [z_null_int] - TS 11013 39 -20 [z_rd_iss] - TS 11014 39 -20 [z_rd_int_0] - TS 11022 38 -19 [z_wr_iss] - TS 11023 39 -20 [z_wr_iss_h] - TS 11024 39 -20 [z_wr_int_0] - TS 11032 39 -20 [z_wr_int_h] - TS 11033 39 -20 [z_fr_iss_0] - TS 11041 39 -20 [z_fr_int] - TS 11042 39 -20 [z_cl_iss] - TS 11043 39 -20 [z_cl_int] - TS 11044 39 -20 [z_ioctl_iss] - TS 11045 39 -20 [z_ioctl_int] - TS 11046 39 -20 [metaslab_group_] - TS 11050 19 0 [z_iput] - TS 11121 38 -19 [z_wr_iss] Note that under Linux the meaning of a processes priority is inverted with respect to illumos. High values on Linux indicate a _low_ priority while high value on illumos indicate a _high_ priority. In order to preserve the logical meaning of the minclsyspri and maxclsyspri macros when they are used by the illumos wrapper functions their values have been inverted. This way when changes are merged from upstream illumos we won't need to remember to invert the macro. It could also lead to confusion. This patch depends on https://github.com/zfsonlinux/spl/pull/466. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #3607
2015-07-24 20:08:31 +03:00
/*
* Process priorities as defined by setpriority(2) and getpriority(2).
*/
#define minclsyspri 19
#define defclsyspri 0
/* Write issue taskq priority. */
#define wtqclsyspri -19
#define maxclsyspri -20
2008-12-11 22:14:49 +03:00
#define CPU_SEQID ((uintptr_t)pthread_self() & (max_ncpus - 1))
#define CPU_SEQID_UNSTABLE CPU_SEQID
2008-12-11 22:14:49 +03:00
#define NN_DIVISOR_1000 (1U << 0)
#define NN_NUMBUF_SZ (6)
2008-12-11 22:14:49 +03:00
extern uint64_t physmem;
extern const char *random_path;
extern const char *urandom_path;
2008-12-11 22:14:49 +03:00
extern int highbit64(uint64_t i);
Add -lhHpw options to "zpool iostat" for avg latency, histograms, & queues Update the zfs module to collect statistics on average latencies, queue sizes, and keep an internal histogram of all IO latencies. Along with this, update "zpool iostat" with some new options to print out the stats: -l: Include average IO latencies stats: total_wait disk_wait syncq_wait asyncq_wait scrub read write read write read write read write wait ----- ----- ----- ----- ----- ----- ----- ----- ----- - 41ms - 2ms - 46ms - 4ms - - 5ms - 1ms - 1us - 4ms - - 5ms - 1ms - 1us - 4ms - - - - - - - - - - - 49ms - 2ms - 47ms - - - - - - - - - - - - - 2ms - 1ms - - - 1ms - ----- ----- ----- ----- ----- ----- ----- ----- ----- 1ms 1ms 1ms 413us 16us 25us - 5ms - 1ms 1ms 1ms 413us 16us 25us - 5ms - 2ms 1ms 2ms 412us 26us 25us - 5ms - - 1ms - 413us - 25us - 5ms - - 1ms - 460us - 29us - 5ms - 196us 1ms 196us 370us 7us 23us - 5ms - ----- ----- ----- ----- ----- ----- ----- ----- ----- -w: Print out latency histograms: sdb total disk sync_queue async_queue latency read write read write read write read write scrub ------- ------ ------ ------ ------ ------ ------ ------ ------ ------ 1ns 0 0 0 0 0 0 0 0 0 ... 33us 0 0 0 0 0 0 0 0 0 66us 0 0 107 2486 2 788 12 12 0 131us 2 797 359 4499 10 558 184 184 6 262us 22 801 264 1563 10 286 287 287 24 524us 87 575 71 52086 15 1063 136 136 92 1ms 152 1190 5 41292 4 1693 252 252 141 2ms 245 2018 0 50007 0 2322 371 371 220 4ms 189 7455 22 162957 0 3912 6726 6726 199 8ms 108 9461 0 102320 0 5775 2526 2526 86 17ms 23 11287 0 37142 0 8043 1813 1813 19 34ms 0 14725 0 24015 0 11732 3071 3071 0 67ms 0 23597 0 7914 0 18113 5025 5025 0 134ms 0 33798 0 254 0 25755 7326 7326 0 268ms 0 51780 0 12 0 41593 10002 10002 0 537ms 0 77808 0 0 0 64255 13120 13120 0 1s 0 105281 0 0 0 83805 20841 20841 0 2s 0 88248 0 0 0 73772 14006 14006 0 4s 0 47266 0 0 0 29783 17176 17176 0 9s 0 10460 0 0 0 4130 6295 6295 0 17s 0 0 0 0 0 0 0 0 0 34s 0 0 0 0 0 0 0 0 0 69s 0 0 0 0 0 0 0 0 0 137s 0 0 0 0 0 0 0 0 0 ------------------------------------------------------------------------------- -h: Help -H: Scripted mode. Do not display headers, and separate fields by a single tab instead of arbitrary space. -q: Include current number of entries in sync & async read/write queues, and scrub queue: syncq_read syncq_write asyncq_read asyncq_write scrubq_read pend activ pend activ pend activ pend activ pend activ ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 0 0 0 0 78 29 0 0 0 0 0 0 0 0 78 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 - - - - - - - - - - 0 0 0 0 0 0 0 0 0 0 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 0 0 227 394 0 19 0 0 0 0 0 0 227 394 0 19 0 0 0 0 0 0 108 98 0 19 0 0 0 0 0 0 19 98 0 0 0 0 0 0 0 0 78 98 0 0 0 0 0 0 0 0 19 88 0 0 0 0 0 0 ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- -p: Display numbers in parseable (exact) values. Also, update iostat syntax to allow the user to specify specific vdevs to show statistics for. The three options for choosing pools/vdevs are: Display a list of pools: zpool iostat ... [pool ...] Display a list of vdevs from a specific pool: zpool iostat ... [pool vdev ...] Display a list of vdevs from any pools: zpool iostat ... [vdev ...] Lastly, allow zpool command "interval" value to be floating point: zpool iostat -v 0.5 Signed-off-by: Tony Hutter <hutter2@llnl.gov Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4433
2016-02-29 21:05:23 +03:00
extern int lowbit64(uint64_t i);
2008-12-11 22:14:49 +03:00
extern int random_get_bytes(uint8_t *ptr, size_t len);
extern int random_get_pseudo_bytes(uint8_t *ptr, size_t len);
static __inline__ uint32_t
random_in_range(uint32_t range)
{
uint32_t r;
ASSERT(range != 0);
if (range == 1)
return (0);
(void) random_get_pseudo_bytes((uint8_t *)&r, sizeof (r));
return (r % range);
}
extern void random_init(void);
extern void random_fini(void);
2008-12-11 22:14:49 +03:00
typedef struct callb_cpr {
kmutex_t *cc_lockp;
} callb_cpr_t;
#define CALLB_CPR_INIT(cp, lockp, func, name) { \
(cp)->cc_lockp = lockp; \
}
#define CALLB_CPR_SAFE_BEGIN(cp) { \
ASSERT(MUTEX_HELD((cp)->cc_lockp)); \
}
#define CALLB_CPR_SAFE_END(cp, lockp) { \
ASSERT(MUTEX_HELD((cp)->cc_lockp)); \
}
#define CALLB_CPR_EXIT(cp) { \
ASSERT(MUTEX_HELD((cp)->cc_lockp)); \
mutex_exit((cp)->cc_lockp); \
}
#define zone_dataset_visible(x, y) (1)
#define INGLOBALZONE(z) (1)
OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery Some work has been done lately to improve the debugability of the ZFS pool load (and import) process. This includes: 7638 Refactor spa_load_impl into several functions 8961 SPA load/import should tell us why it failed 7277 zdb should be able to print zfs_dbgmsg's To iterate on top of that, there's a few changes that were made to make the import process more resilient and crash free. One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened. The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. In the past, the two configurations were compared together and if there was a mismatch then the load process was aborted and an error was returned. The latter was a good way to ensure a pool does not get corrupted, however it made the pool load process needlessly fragile in cases where the vdev configuration changed or the userland configuration was outdated. Since the MOS is stored in 3 copies, the configuration provided by userland doesn't have to be perfect in order to read its contents. Hence, a new approach has been adopted: The pool is first opened with the untrusted userland configuration just so that the real configuration can be read from the MOS. The trusted MOS configuration is then used to generate a new vdev tree and the pool is re-opened. When the pool is opened with an untrusted configuration, writes are disabled to avoid accidentally damaging it. During reads, some sanity checks are performed on block pointers to see if each DVA points to a known vdev; when the configuration is untrusted, instead of panicking the system if those checks fail we simply avoid issuing reads to the invalid DVAs. This new two-step pool load process now allows rewinding pools accross vdev tree changes such as device replacement, addition, etc. Loading a pool from an external config file in a clustering environment also becomes much safer now since the pool will import even if the config is outdated and didn't, for instance, register a recent device addition. With this code in place, it became relatively easy to implement a long-sought-after feature: the ability to import a pool with missing top level (i.e. non-redundant) devices. Note that since this almost guarantees some loss of data, this feature is for now restricted to a read-only import. Porting notes (ZTS): * Fix 'make dist' target in zpool_import * The maximum path length allowed by tar is 99 characters. Several of the new test cases exceeded this limit resulting in them not being included in the tarball. Shorten the names slightly. * Set/get tunables using accessor functions. * Get last synced txg via the "zfs_txg_history" mechanism. * Clear zinject handlers in cleanup for import_cache_device_replaced and import_rewind_device_replaced in order that the zpool can be exported if there is an error. * Increase FILESIZE to 8G in zfs-test.sh to allow for a larger ext4 file system to be created on ZFS_DISK2. Also, there's no need to partition ZFS_DISK2 at all. The partitioning had already been disabled for multipath devices. Among other things, the partitioning steals some space from the ext4 file system, makes it difficult to accurately calculate the paramters to parted and can make some of the tests fail. * Increase FS_SIZE and FILE_SIZE in the zpool_import test configuration now that FILESIZE is larger. * Write more data in order that device evacuation take lonnger in a couple tests. * Use mkdir -p to avoid errors when the directory already exists. * Remove use of sudo in import_rewind_config_changed. Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/9075 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123 Closes #7459
2016-07-22 17:39:36 +03:00
extern uint32_t zone_get_hostid(void *zonep);
2008-12-11 22:14:49 +03:00
/*
* Hostname information
*/
extern int ddi_strtoull(const char *str, char **nptr, int base,
u_longlong_t *result);
typedef struct utsname utsname_t;
extern utsname_t *utsname(void);
2008-12-11 22:14:49 +03:00
/* ZFS Boot Related stuff. */
struct _buf {
intptr_t _fd;
};
struct bootstat {
uint64_t st_size;
};
extern int zfs_secpolicy_snapshot_perms(const char *name, cred_t *cr);
extern int zfs_secpolicy_rename_perms(const char *from, const char *to,
cred_t *cr);
extern int zfs_secpolicy_destroy_perms(const char *name, cred_t *cr);
Add `zfs allow` and `zfs unallow` support ZFS allows for specific permissions to be delegated to normal users with the `zfs allow` and `zfs unallow` commands. In addition, non- privileged users should be able to run all of the following commands: * zpool [list | iostat | status | get] * zfs [list | get] Historically this functionality was not available on Linux. In order to add it the secpolicy_* functions needed to be implemented and mapped to the equivalent Linux capability. Only then could the permissions on the `/dev/zfs` be relaxed and the internal ZFS permission checks used. Even with this change some limitations remain. Under Linux only the root user is allowed to modify the namespace (unless it's a private namespace). This means the mount, mountpoint, canmount, unmount, and remount delegations cannot be supported with the existing code. It may be possible to add this functionality in the future. This functionality was validated with the cli_user and delegation test cases from the ZFS Test Suite. These tests exhaustively verify each of the supported permissions which can be delegated and ensures only an authorized user can perform it. Two minor bug fixes were required for test-running.py. First, the Timer() object cannot be safely created in a `try:` block when there is an unconditional `finally` block which references it. Second, when running as a normal user also check for scripts using the both the .ksh and .sh suffixes. Finally, existing users who are simulating delegations by setting group permissions on the /dev/zfs device should revert that customization when updating to a version with this change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #362 Closes #434 Closes #4100 Closes #4394 Closes #4410 Closes #4487
2016-06-07 19:16:52 +03:00
extern int secpolicy_zfs(const cred_t *cr);
2008-12-11 22:14:49 +03:00
extern zoneid_t getzoneid(void);
/* SID stuff */
typedef struct ksiddomain {
uint_t kd_ref;
uint_t kd_len;
char *kd_name;
} ksiddomain_t;
ksiddomain_t *ksid_lookupdomain(const char *);
void ksiddomain_rele(ksiddomain_t *);
2009-07-03 02:44:48 +04:00
#define DDI_SLEEP KM_SLEEP
#define ddi_log_sysevent(_a, _b, _c, _d, _e, _f, _g) \
sysevent_post_event(_c, _d, _b, "libzpool", _e, _f)
Add zstd support to zfs This PR adds two new compression types, based on ZStandard: - zstd: A basic ZStandard compression algorithm Available compression. Levels for zstd are zstd-1 through zstd-19, where the compression increases with every level, but speed decreases. - zstd-fast: A faster version of the ZStandard compression algorithm zstd-fast is basically a "negative" level of zstd. The compression decreases with every level, but speed increases. Available compression levels for zstd-fast: - zstd-fast-1 through zstd-fast-10 - zstd-fast-20 through zstd-fast-100 (in increments of 10) - zstd-fast-500 and zstd-fast-1000 For more information check the man page. Implementation details: Rather than treat each level of zstd as a different algorithm (as was done historically with gzip), the block pointer `enum zio_compress` value is simply zstd for all levels, including zstd-fast, since they all use the same decompression function. The compress= property (a 64bit unsigned integer) uses the lower 7 bits to store the compression algorithm (matching the number of bits used in a block pointer, as the 8th bit was borrowed for embedded block pointers). The upper bits are used to store the compression level. It is necessary to be able to determine what compression level was used when later reading a block back, so the concept used in LZ4, where the first 32bits of the on-disk value are the size of the compressed data (since the allocation is rounded up to the nearest ashift), was extended, and we store the version of ZSTD and the level as well as the compressed size. This value is returned when decompressing a block, so that if the block needs to be recompressed (L2ARC, nop-write, etc), that the same parameters will be used to result in the matching checksum. All of the internal ZFS code ( `arc_buf_hdr_t`, `objset_t`, `zio_prop_t`, etc.) uses the separated _compress and _complevel variables. Only the properties ZAP contains the combined/bit-shifted value. The combined value is split when the compression_changed_cb() callback is called, and sets both objset members (os_compress and os_complevel). The userspace tools all use the combined/bit-shifted value. Additional notes: zdb can now also decode the ZSTD compression header (flag -Z) and inspect the size, version and compression level saved in that header. For each record, if it is ZSTD compressed, the parameters of the decoded compression header get printed. ZSTD is included with all current tests and new tests are added as-needed. Per-dataset feature flags now get activated when the property is set. If a compression algorithm requires a feature flag, zfs activates the feature when the property is set, rather than waiting for the first block to be born. This is currently only used by zstd but can be extended as needed. Portions-Sponsored-By: The FreeBSD Foundation Co-authored-by: Allan Jude <allanjude@freebsd.org> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Co-authored-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl> Co-authored-by: Michael Niewöhner <foss@mniewoehner.de> Signed-off-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Allan Jude <allanjude@freebsd.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Signed-off-by: Kjeld Schouten-Lebbing <kjeld@schouten-lebbing.nl> Signed-off-by: Michael Niewöhner <foss@mniewoehner.de> Closes #6247 Closes #9024 Closes #10277 Closes #10278
2020-08-18 20:10:17 +03:00
/*
* Kernel modules
*/
#define __init
#define __exit
#endif /* _KERNEL || _STANDALONE */
#ifdef __cplusplus
};
#endif
2008-12-11 22:14:49 +03:00
#endif /* _SYS_ZFS_CONTEXT_H */