Most of this file was a pile of defines, apparently from Solaris that
controlled nothing in the source tree. A few things controlled the
definition of unused types or macros which I have removed.
Considerable further cleanup is possible including removal of
architectures FreeBSD never supported. This file should likely converge
with the Linux version to the extent possible.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#14127
Check __riscv_xlen == 64 rather than _LP64 and define _LP64 if missing.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#14128
The extern declaration is only for Linux, move this line
into the right #ifdef section.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Martin Matuska <mm@FreeBSD.org>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#13934Closes#13936
Add needed cpu feature tests for powerpc architecture.
Overview:
zfs_altivec_available() - needed by RAID-Z
zfs_vsx_available() - needed by BLAKE3
zfs_isa207_available() - needed by SHA2
Part 1 - Userspace
- use getauxval() for Linux and elf_aux_info() for FreeBSD
- direct including <sys/auxv.h> fails with double definitions
- so we self define the needed functions and definitions
Part 2 - Kernel space FreeBSD
- use exported cpu_features of <powerpc/cpu.h>
Part 3 - Kernel space Linux
- use cpu_has_feature() function of <asm/cpufeature.h>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#13725
This allows ZFS datasets to be delegated to a user/mount namespace
Within that namespace, only the delegated datasets are visible
Works very similarly to Zones/Jailes on other ZFS OSes
As a user:
```
$ unshare -Um
$ zfs list
no datasets available
$ echo $$
1234
```
As root:
```
# zfs list
NAME ZONED MOUNTPOINT
containers off /containers
containers/host off /containers/host
containers/host/child off /containers/host/child
containers/host/child/gchild off /containers/host/child/gchild
containers/unpriv on /unpriv
containers/unpriv/child on /unpriv/child
containers/unpriv/child/gchild on /unpriv/child/gchild
# zfs zone /proc/1234/ns/user containers/unpriv
```
Back to the user namespace:
```
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
containers 129M 47.8G 24K /containers
containers/unpriv 128M 47.8G 24K /unpriv
containers/unpriv/child 128M 47.8G 128M /unpriv/child
```
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Will Andrews <will.andrews@klarasystems.com>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Co-authored-by: Allan Jude <allan@klarasystems.com>
Co-authored-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Buddy <https://buddy.works>
Closes#12263
This commit adds BLAKE3 checksums to OpenZFS, it has similar
performance to Edon-R, but without the caveats around the latter.
Homepage of BLAKE3: https://github.com/BLAKE3-team/BLAKE3
Wikipedia: https://en.wikipedia.org/wiki/BLAKE_(hash_function)#BLAKE3
Short description of Wikipedia:
BLAKE3 is a cryptographic hash function based on Bao and BLAKE2,
created by Jack O'Connor, Jean-Philippe Aumasson, Samuel Neves, and
Zooko Wilcox-O'Hearn. It was announced on January 9, 2020, at Real
World Crypto. BLAKE3 is a single algorithm with many desirable
features (parallelism, XOF, KDF, PRF and MAC), in contrast to BLAKE
and BLAKE2, which are algorithm families with multiple variants.
BLAKE3 has a binary tree structure, so it supports a practically
unlimited degree of parallelism (both SIMD and multithreading) given
enough input. The official Rust and C implementations are
dual-licensed as public domain (CC0) and the Apache License.
Along with adding the BLAKE3 hash into the OpenZFS infrastructure a
new benchmarking file called chksum_bench was introduced. When read
it reports the speed of the available checksum functions.
On Linux: cat /proc/spl/kstat/zfs/chksum_bench
On FreeBSD: sysctl kstat.zfs.misc.chksum_bench
This is an example output of an i3-1005G1 test system with Debian 11:
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1196 1602 1761 1749 1762 1759 1751
skein-generic 546 591 608 615 619 612 616
sha256-generic 240 300 316 314 304 285 276
sha512-generic 353 441 467 476 472 467 426
blake3-generic 308 313 313 313 312 313 312
blake3-sse2 402 1289 1423 1446 1432 1458 1413
blake3-sse41 427 1470 1625 1704 1679 1607 1629
blake3-avx2 428 1920 3095 3343 3356 3318 3204
blake3-avx512 473 2687 4905 5836 5844 5643 5374
Output on Debian 5.10.0-10-amd64 system: (Ryzen 7 5800X)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1840 2458 2665 2719 2711 2723 2693
skein-generic 870 966 996 992 1003 1005 1009
sha256-generic 415 442 453 455 457 457 457
sha512-generic 608 690 711 718 719 720 721
blake3-generic 301 313 311 309 309 310 310
blake3-sse2 343 1865 2124 2188 2180 2181 2186
blake3-sse41 364 2091 2396 2509 2463 2482 2488
blake3-avx2 365 2590 4399 4971 4915 4802 4764
Output on Debian 5.10.0-9-powerpc64le system: (POWER 9)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 1213 1703 1889 1918 1957 1902 1907
skein-generic 434 492 520 522 511 525 525
sha256-generic 167 183 187 188 188 187 188
sha512-generic 186 216 222 221 225 224 224
blake3-generic 153 152 154 153 151 153 153
blake3-sse2 391 1170 1366 1406 1428 1426 1414
blake3-sse41 352 1049 1212 1174 1262 1258 1259
Output on Debian 5.10.0-11-arm64 system: (Pi400)
implementation 1k 4k 16k 64k 256k 1m 4m
edonr-generic 487 603 629 639 643 641 641
skein-generic 271 299 303 308 309 309 307
sha256-generic 117 127 128 130 130 129 130
sha512-generic 145 165 170 172 173 174 175
blake3-generic 81 29 71 89 89 89 89
blake3-sse2 112 323 368 379 380 371 374
blake3-sse41 101 315 357 368 369 364 360
Structurally, the new code is mainly split into these parts:
- 1x cross platform generic c variant: blake3_generic.c
- 4x assembly for X86-64 (SSE2, SSE4.1, AVX2, AVX512)
- 2x assembly for ARMv8 (NEON converted from SSE2)
- 2x assembly for PPC64-LE (POWER8 converted from SSE2)
- one file for switching between the implementations
Note the PPC64 assembly requires the VSX instruction set and the
kfpu_begin() / kfpu_end() calls on PowerPC were updated accordingly.
Reviewed-by: Felix Dörre <felix@dogcraft.de>
Reviewed-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Co-authored-by: Rich Ercolani <rincebrain@gmail.com>
Closes#10058Closes#12918
bcopy() has a confusing argument order and is actually a move, not a
copy; they're all deprecated since POSIX.1-2001 and removed in -2008,
and we shim them out to mem*() on Linux anyway
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Closes#12996
aarch64 is a different architecture than arm. Some
compilers might choke when both __arm__ and __aarch64__
are defined.
This change separates the checks for arm and for
aarch64 in the isa_defs.h header files.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Windel Bouwman <windel@windel.nl>
Closes#10335Closes#13151
A function that returns with no value is a different thing from a
function that doesn't return at all. Those are two orthogonal
concepts, commonly confused.
pthread_create(3) expects a pointer to a start routine that has a
very precise prototype:
void *(*start_routine)(void *);
However, other thread functions, such as kernel ones, expect:
void (*start_routine)(void *);
Providing a different one is incorrect, and has only been working
because the ABIs happen to produce a compatible function.
We should use '_Noreturn void', since it's the natural type, and
then provide a '_Noreturn void *' wrapper for pthread functions.
For consistency, replace most cases of __NORETURN or
__attribute__((noreturn)) by _Noreturn. _Noreturn is understood
by -std=gnu89, so it should be safe to use everywhere.
Ref: https://github.com/openzfs/zfs/pull/13110#discussion_r808450136
Ref: https://software.codidact.com/posts/285972
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Ahelenia Ziemiańska <nabijaczleweli@nabijaczleweli.xyz>
Signed-off-by: Alejandro Colomar <alx.manpages@gmail.com>
Closes#13120
Unfortunately macOS has obj-C keyword "fallthrough" in the OS headers.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#13097
Do not redefine the fallthrough macro when building with libcpp.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes#12880
As of the Linux 5.9 kernel a fallthrough macro has been added which
should be used to anotate all intentional fallthrough paths. Once
all of the kernel code paths have been updated to use fallthrough
the -Wimplicit-fallthrough option will because the default. To
avoid warnings in the OpenZFS code base when this happens apply
the fallthrough macro.
Additional reading: https://lwn.net/Articles/794944/
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#12441
This mostly reverts "3537 want pool io kstats" commit of 8 years ago.
From one side this code using pool-wide locks became pretty bad for
performance, creating significant lock contention in I/O pipeline.
From another, there are more efficient ways now to obtain detailed
statistics, while this statistics is illumos-specific and much less
usable on Linux and FreeBSD, reported only via procfs/sysctls.
This commit does not remove KSTAT_TYPE_IO implementation, that may
be removed later together with already unused KSTAT_TYPE_INTR and
KSTAT_TYPE_TIMER.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12212
wmsum counters are a reduced version of aggsum counters, optimized for
write-mostly scenarios. They do not provide optimized read functions,
but instead allow much cheaper add function. The primary usage is
infrequently read statistic counters, not requiring exact precision.
The Linux implementation is directly mapped into percpu_counter KPI.
The FreeBSD implementation is directly mapped into counter(9) KPI.
In user-space due to lack of better implementation mapped to aggsum.
Unfortunately neither Linux percpu_counter nor FreeBSD counter(9)
provide sufficient functionality to completelly replace aggsum, so
it still remains to be used for several hot counters.
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored-By: iXsystems, Inc.
Closes#12114
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#11775
In FreeBSD the struct uio was just a typedef to uio_t. In order to
extend this struct, outside of the definition for the struct uio, the
struct uio has been embedded inside of a uio_t struct.
Also renamed all the uio_* interfaces to be zfs_uio_* to make it clear
this is a ZFS interface.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Atkinson <batkinson@lanl.gov>
Closes#11438
As of the 5.10 kernel the generic splice compatibility code has been
removed. All filesystems are now responsible for registering a
->splice_read and ->splice_write callback to support this operation.
The good news is the VFS provided generic_file_splice_read() and
iter_file_splice_write() callbacks can be used provided the ->iter_read
and ->iter_write callback support pipes. However, this is currently
not the case and only iovecs and bvecs (not pipes) are ever attached
to the uio structure.
This commit changes that by allowing full iov_iter structures to be
attached to uios. Ever since the 4.9 kernel the iov_iter structure
has supported iovecs, kvecs, bvevs, and pipes so it's desirable to
pass the entire thing when possible. In conjunction with this the
uio helper functions (i.e uiomove(), uiocopy(), etc) have been
updated to understand the new UIO_ITER type.
Note that using the kernel provided uio_iter interfaces allowed the
existing Linux specific uio handling code to be simplified. When
there's no longer a need to support kernel's older than 4.9, then
it will be possible to remove the iovec and bvec members from the
uio structure and always use a uio_iter. Until then we need to
maintain all of the existing types for older kernels.
Some additional refactoring and cleanup was included in this change:
- Added checks to configure to detect available iov_iter interfaces.
Some are available all the way back to the 3.10 kernel and are used
when available. In particular, uio_prefaultpages() now always uses
iov_iter_fault_in_readable() which is available for all supported
kernels.
- The unused UIO_USERISPACE type has been removed. It is no longer
needed now that the uio_seg enum is platform specific.
- Moved zfs_uio.c from the zcommon.ko module to the Linux specific
platform code for the zfs.ko module. This gets it out of libzfs
where it was never needed and keeps this Linux specific code out
of the common sources.
- Removed unnecessary O_APPEND handling from zfs_iter_write(), this
is redundant and O_APPEND is already handled in zfs_write();
Reviewed-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#11351
The original xuio zero copy functionality has always been unused
on Linux and FreeBSD. Remove this disabled code to avoid any
confusion and improve readability.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#11124
The zfs_fsync, zfs_read, and zfs_write function are almost identical
between Linux and FreeBSD. With a little refactoring they can be
moved to the common code which is what is done by this commit.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#11078
The acltype property is currently hidden on FreeBSD and does not
reflect the NFSv4 style ZFS ACLs used on the platform. This makes it
difficult to observe that a pool imported from FreeBSD on Linux has a
different type of ACL that is being ignored, and vice versa.
Add an nfsv4 acltype and expose the property on FreeBSD.
Make the default acltype nfsv4 on FreeBSD.
Setting acltype to an unhanded style is treated the same as setting
it to off. The ACLs will not be removed, but they will be ignored.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes#10520
A collection of header changes to enable FreeBSD to build
with vendored OpenZFS.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10635
FreeBSD defines _BIG_ENDIAN BIG_ENDIAN _LITTLE_ENDIAN
LITTLE_ENDIAN on every architecture. Trying to do
cross builds whilst hiding this from ZFS has proven
extremely cumbersome.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#10621
There's no need to specify the srcdir explicitly in _HEADERS and
EXTRA_DIST.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arvind Sankar <nivedita@alum.mit.edu>
Closes#10493
The macOS uio struct is opaque and the API must be used, this
makes the smallest changes to the code for all platforms.
Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes#10412
The horrible effects of human slavery continue to impact society. The
casual use of the term "slave" in computer software is an unnecessary
reference to a painful human experience.
This commit removes all possible references to the term "slave".
Implementation notes:
The zpool.d/slaves script is renamed to dm-deps, which uses the same
terminology as `dmsetup deps`.
References to the `/sys/class/block/$dev/slaves` directory remain. This
directory name is determined by the Linux kernel. Although
`dmsetup deps` provides the same information, it unfortunately requires
elevated privileges, whereas the `/sys/...` directory is world-readable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10435
This aids in debugging, so that we can use the same infrastructure to
walk zfs's list_t in the kernel module and in the userland libraries
(e.g. when debugging ztest).
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#10236
As part of the Linux kernel's y2038 changes the time_t type has been
fully retired. Callers are now required to use the time64_t type.
Rather than move to the new type, I've removed the few remaining
places where a time_t is used in the kernel code. They've been
replaced with a uint64_t which is already how ZFS internally
handled these values.
Going forward we should work towards updating the remaining user
space time_t consumers to the 64-bit interfaces.
Reviewed-by: Matthew Macy <mmacy@freebsd.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#10052Closes#10064
Currently SIMD accelerated AES-GCM performance is limited by two
factors:
a. The need to disable preemption and interrupts and save the FPU
state before using it and to do the reverse when done. Due to the
way the code is organized (see (b) below) we have to pay this price
twice for each 16 byte GCM block processed.
b. Most processing is done in C, operating on single GCM blocks.
The use of SIMD instructions is limited to the AES encryption of the
counter block (AES-NI) and the Galois multiplication (PCLMULQDQ).
This leads to the FPU not being fully utilized for crypto
operations.
To solve (a) we do crypto processing in larger chunks while owning
the FPU. An `icp_gcm_avx_chunk_size` module parameter was introduced
to make this chunk size tweakable. It defaults to 32 KiB. This step
alone roughly doubles performance. (b) is tackled by porting and
using the highly optimized openssl AES-GCM assembler routines, which
do all the processing (CTR, AES, GMULT) in a single routine. Both
steps together result in up to 32x reduction of the time spend in
the en/decryption routines, leading up to approximately 12x
throughput increase for large (128 KiB) blocks.
Lastly, this commit changes the default encryption algorithm from
AES-CCM to AES-GCM when setting the `encryption=on` property.
Reviewed-By: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-By: Jason King <jason.king@joyent.com>
Reviewed-By: Tom Caputi <tcaputi@datto.com>
Reviewed-By: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes#9749
Implements the RAID-Z function using AltiVec SIMD.
This is basically the NEON code translated to AltiVec.
Note that the 'fletcher' algorithm requires 64-bits
operations, and the initial implementations of AltiVec
(PPC74xx a.k.a. G4, PPC970 a.k.a. G5) only has up to
32-bits operations, so no 'fletcher'.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu>
Closes#9539
Remove the ASSERTV macro and handle suppressing unused
compiler warnings for variables only in ASSERTs using the
__attribute__((unused)) compiler annotation. The annotation
is understood by both gcc and clang.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9671
- on Linux move Linux specific headers to zfs_context_os.h
- on FreeBSD move FreeBSD specific definitions to zfs_context_os.h
- remove duplicate tsd_ definitions
- remove unused AT_TYPE
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9668
This adds basic support for RISC-V, specifically RV64G.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@european-processor-initiative.eu>
Closes#9540
This change modifies some of the infrastructure for enabling the use of
the DTRACE_PROBE* macros, such that we can use tehm in the "spl" module.
Currently, when the DTRACE_PROBE* macros are used, they get expanded to
create new functions, and these dynamically generated functions become
part of the "zfs" module.
Since the "spl" module does not depend on the "zfs" module, the use of
DTRACE_PROBE* in the "spl" module would result in undefined symbols
being used in the "spl" module. Specifically, DTRACE_PROBE* would turn
into a function call, and the function being called would be a symbol
only contained in the "zfs" module; which results in a linker and/or
runtime error.
Thus, this change adds the necessary logic to the "spl" module, to
mirror the tracing functionality available to the "zfs" module. After
this change, we'll have a "trace_zfs.h" header file which defines the
probes available only to the "zfs" module, and a "trace_spl.h" header
file which defines the probes available only to the "spl" module.
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes#9525
FreeBSD has its own sha routines that the port uses.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9530
The sys/signal.h header doesn't exist on FreeBSD, nor is
it needed on Linux.
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9510
EBADE, EBADR, and ENOANO do not exist on FreeBSD
The libspl errno.h is similarly platform dependent.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9498
It's mostly a noop on ZoL and it conflicts with platforms that
support dtrace. Remove this header to resolve the conflict.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9497
Contrary to initial testing we cannot rely on these kernels to
invalidate the per-cpu FPU state and restore the FPU registers.
Nor can we guarantee that the kernel won't modify the FPU state
which we saved in the task struck.
Therefore, the kfpu_begin() and kfpu_end() functions have been
updated to save and restore the FPU state using our own dedicated
per-cpu FPU state variables.
This has the additional advantage of allowing us to use the FPU
again in user threads. So we remove the code which was added to
use task queues to ensure some functions ran in kernel threads.
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #9346Closes#9403
Factor Linux specific pieces out of libspl.
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9336
Move Linux specific tracing headers and source to platform directories
and update the build system.
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Closes#9290
Move platform specific Linux headers under include/os/linux/.
Update the build system accordingly to detect the platform.
This lays some of the initial groundwork to supporting building
for other platforms.
As part of this change it was necessary to create both a user
and kernel space sys/simd.h header which can be included in
either context. No functional change, the source has been
refactored and the relevant #include's updated.
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Matthew Macy <mmacy@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#9198
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net>
Closes#9237
ZFS supports O_RSYNC for read operations and when specified will ensure
the same level of data integrity that O_DSYNC and O_SYNC provides for
writes. O_RSYNC by itself has no effect so it must be combined with
either O_DSYNC or O_SYNC. However, many platforms don't support O_RSYNC
and have mapped O_SYNC to mean O_RSYNC within ZFS. This is incorrect
and causes unnecessary calls to zil_commit. Only platforms which
support O_RSYNC should implement the zil_commit functionality in the
read code path.
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes#8523
Update the SA_COPY_DATA macro to check if architecture supports
efficient unaligned memory accesses at compile time. Otherwise
fallback to using the sa_copy_data() function.
The kernel provided CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is
used to determine availability in kernel space. In user space
the x86_64, x86, powerpc, and sometimes arm architectures will
define the HAVE_EFFICIENT_UNALIGNED_ACCESS macro.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7642Closes#7684
Commit torvalds/linux@95582b0 changes the inode i_atime, i_mtime,
and i_ctime members form timespec's to timespec64's to make them
2038 safe. As part of this change the current_time() function was
also updated to return the timespec64 type.
Resolve this issue by introducing a new inode_timespec_t type which
is defined to match the timespec type used by the inode. It should
be used when working with inode timestamps to ensure matching types.
The timestruc_t type under Illumos was used in a similar fashion but
was specified to always be a timespec_t. Rather than incorrectly
define this type all timespec_t types have been replaced by the new
inode_timespec_t type.
Finally, the kernel and user space 'sys/time.h' headers were aligned
with each other. They define as appropriate for the context several
constants as macros and include static inline implementation of
gethrestime(), gethrestime_sec(), and gethrtime().
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#7643
1. Add a proc entry to display the pool's state:
$ cat /proc/spl/kstat/zfs/tank/state
ONLINE
This is done without using the spa config locks, so it will
never hang.
2. Fix 'zpool status' and 'zpool list -o health' output to print
"SUSPENDED" instead of "ONLINE" for suspended pools.
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#7331Closes#7563
Minimal changes required to integrate the SPL sources in to the
ZFS repository build infrastructure and packaging.
Build system and packaging:
* Renamed SPL_* autoconf m4 macros to ZFS_*.
* Removed redundant SPL_* autoconf m4 macros.
* Updated the RPM spec files to remove SPL package dependency.
* The zfs package obsoletes the spl package, and the zfs-kmod
package obsoletes the spl-kmod package.
* The zfs-kmod-devel* packages were updated to add compatibility
symlinks under /usr/src/spl-x.y.z until all dependent packages
can be updated. They will be removed in a future release.
* Updated copy-builtin script for in-kernel builds.
* Updated DKMS package to include the spl.ko.
* Updated stale AUTHORS file to include all contributors.
* Updated stale COPYRIGHT and included the SPL as an exception.
* Renamed README.markdown to README.md
* Renamed OPENSOLARIS.LICENSE to LICENSE.
* Renamed DISCLAIMER to NOTICE.
Required code changes:
* Removed redundant HAVE_SPL macro.
* Removed _BOOT from nvpairs since it doesn't apply for Linux.
* Initial header cleanup (removal of empty headers, refactoring).
* Remove SPL repository clone/build from zimport.sh.
* Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due
to build issues when forcing C99 compilation.
* Replaced legacy ACCESS_ONCE with READ_ONCE.
* Include needed headers for `current` and `EXPORT_SYMBOL`.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
TEST_ZIMPORT_SKIP="yes"
Closes#7556
The functionality provided by this header is not required by any
of the ZFS user space code. Minimal functionality was provided
in commit c28a677 which added include/sys/frame.h.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6960Closes#6972
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>
Problem
=======
The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:
1. When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.
As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).
To further complicate the situation, these two issues result in the
following side effect:
3. If a given batch takes longer to complete than normal, this results
in larger batch sizes, which then take longer to complete and
further drive up the latency of zil_commit(). This can occur for a
number of reasons, including (but not limited to): transient changes
in the workload, and storage latency irregularites.
Solution
========
The solution attempted by this change has the following goals:
1. no on-disk changes; maintain current on-disk format.
2. modify the "batch size" to be equal to the "ZIL block size".
3. allow new batches to be generated and issued to disk, while there's
already batches being serviced by the disk.
4. allow zil_commit() to wait for as few ZIL blocks as possible.
5. use as few ZIL blocks as possible, for the same amount of ZIL
transactions, without introducing significant latency to any
individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.
In theory, with these goals met, the new allgorithm will allow the
following improvements:
1. new ZIL blocks can be generated and issued, while there's already
oustanding ZIL blocks being serviced by the storage.
2. the latency of zil_commit() should be proportional to the underlying
storage latency, rather than the incoming synchronous workload.
Porting Notes
=============
Due to the changes made in commit 119a394ab0, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).
To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:
* A list of itxs was added to the lwb structure. This list contains
all of the itxs that have been committed to the lwb, such that the
callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
after the data for the itxs is committed to disk.
* A list of itxs was added on the stack of the zil_process_commit_list()
function; the "nolwb_itxs" list. In some circumstances, an itx may
not be committed to an lwb (e.g. if allocating the "next" ZIL block
on disk fails), so this list is used to keep track of which itxs
fall into this state, such that their callbacks can be called after
the ZIL's writer pipeline is "stalled".
* The logic to actually call the itx's callback was moved into the
zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
were effectively performing the same logic (i.e. if callback is
non-null, call the callback), it seemed like useful code cleanup to
consolidate this logic into a single function.
Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:
* The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
removed from "trace_zil.h" as well.
* Some of the zilog structure's fields were removed, which affected
the tracepoint definitions of the structure.
* New tracepoints had to be added for the following 3 new probes:
* zil__process__commit__itx
* zil__process__normal__itx
* zil__commit__io__error
OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5d95a3aCloses#6566
This change incorporates three major pieces:
The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.
The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.
The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes#494Closes#5769
The pool name can be 256 chars long. Today, in /proc/spl/kstat/zfs/
the name is limited to < 32 characters. This change is to allows
bigger pool names.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: gaurkuma <gauravk.18@gmail.com>
Closes#6481
OpenZFS provides a library called tpool which implements thread
pools for user space applications. Porting this library means
the zpool utility no longer needs to borrow the kernel mutex and
taskq interfaces from libzpool. This code was updated to use
the tpool library which behaves in a very similar fashion.
Porting libtpool was relatively straight forward and minimal
modifications were needed. The core changes were:
* Fully convert the library to use pthreads.
* Updated signal handling.
* lmalloc/lfree converted to calloc/free
* Implemented portable pthread_attr_clone() function.
Finally, update the build system such that libzpool.so is no
longer linked in to zfs(8), zpool(8), etc. All that is required
is libzfs to which the zcommon soures were added (which is the way
it always should have been). Removing the libzpool dependency
resulted in several build issues which needed to be resolved.
* Moved zfeature support to module/zcommon/zfeature_common.c
* Moved ratelimiting to to module/zfs/zfs_ratelimit.c
* Moved get_system_hostid() to lib/libspl/gethostid.c
* Removed use of cmn_err() in zcommon source
* Removed dprintf_setup() call from zpool_main.c and zfs_main.c
* Removed highbit() and lowbit()
* Removed unnecessary library dependencies from Makefiles
* Removed fletcher-4 kstat in user space
* Added sha2 support explicitly to libzfs
* Added highbit64() and lowbit64() to zpool_util.c
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6442
musl's sys/errno.h is literally:
/#warning redirecting incorrect #include <sys/errno.h> to <errno.h>
/#include <errno.h>
It does the same for sys/{poll,signal}.h. This is rather noisy when
building ZoL against musl. musl is also correct in pointing out that the
correct headers are outside of sys/ according to the single unix
specification:
http://pubs.opengroup.org/onlinepubs/7908799/xsh/errno.h.htmlhttp://pubs.opengroup.org/onlinepubs/7908799/xsh/poll.h.htmlhttp://pubs.opengroup.org/onlinepubs/7908799/xsh/signal.h.html
Lets implement our own sys/* versions of these headers to redirect to
the proper userland ones when building in userspace. That will silence
the warning.
There are also some instances where we include incorrectly from sys/ or
from outside of sys/ in userspace only code. In these instances, lets
just fix the includes directly.
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#5993
In glibc 2.5, makedev(), major(), and minor() are defined in
sys/sysmacros.h. They are also defined in types.h for backward
compatability, but using these definitions triggers a compile warning.
This breaks the ZFS build, as it builds with -Werror.
autoconf email threads indicate these macros may be defined in
sys/mkdev.h in some cases.
This commit adds configure checks to detect where makedev() is defined:
sys/sysmacros.h
sys/mkdev.h
It assumes major() and minor() are defined in the same place.
The libspl types.h then includes
sys/sysmacros.h (preferred) or
sys/mkdev.h (2nd choice)
if one of those defines makedev().
This is done before including the system types.h.
An alternative would be to remove uses of major, minor, and makedev,
instead comparing the st_dev returned from stat64. These configure
checks would then be unnecessary.
This change revealed that __NORETURN was being defined unnecessarily in
libspl/include/sys/sysmacros.h. That definition is removed.
The files in which __NORETURN are used all include types.h, and so all
will get the definition provided by feature_tests.h
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#5945
libspl tries to detect sparc64 by checking whether __sparc64__
is defined. Unfortunately, this assumption is not correct as
sparc64 does not define __sparc64__ but it defines __sparc__
and __arch64__ instead. This leads to sparc64 being detected
as 32-Bit sparc and the build fails because both _ILP32 and
_LP64 are defined in this case.
To fix the problem, remove the checks for __sparc64__ and
just check __arch64__ if a sparc host was previously
detected with __sparc__.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Closes#5913
Unlike other architectures which sanitize the LDFLAGS from the
environment in arch/<arch>/Makefile. The powerpc Makefile
allows LDFLAGS to be passed through resulting in the following
build failure.
/usr/bin/ld: unrecognized option '-Wl,-z,relro'
LDFLAGS is set in /usr/lib/rpm/redhat/macros by default. Clear
the environment variable when building kmods for powerpc.
Additionally, now that ppc64le exists it's not longer safe to
assume a powerpc system is big endian. Rely on the endianness
provided by the compiler.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5856
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes#5547Closes#5543
* Convert ABD to use the Linux Kernel scatterlist implementation
instead of the hand rolled one from illumos.
* Scatter ABDs are preferentially populated with higher order
compound pages from a single zone. Allocation size is
progressively decreased until it can be satisfied without
performing reclaim or compaction.
* An alternate page allocator is provided for kernels older
than 3.6 and for CONFIG_HIGHMEM systems. This allocator
is designed as a fallback for maximum compatibility.
* Extended abdstats to provide visibility in the the allocator.
* Add cached value for PAGESIZE in userspace.
Contributions-by:
Chunwei Chen <david.chen@osnexus.com>
Gvozden Neskovic <neskovic@gmail.com>
Jinshan Xiong <jinshan.xiong@intel.com>
Isaac Huang <he.huang@intel.com>
David Quigley <david.quigley@intel.com>
Brian Behlendorf <behlendorf1@llnl.gov>
Undefined operation is reported by running ztest (or zloop) compiled with GCC
UndefinedBehaviorSanitizer. Error only happens on top level of dnode indirection
with large enough offset values. Logically, left shift operation would work,
but bit shift semantics in C, and limitation of uint64_t, do not produce desired
result.
Issue #5059, #4883
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Fix misleading error message:
"The /dev/zfs device is missing and must be created.", if /etc/mtab is missing.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Eric Desrochers <eric.desrochers@canonical.com>
Closes#4680Closes#5029
Authored by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Signed-off-by: Don Brady <don.brady@intel.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/5997
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1437283
Porting Notes:
In addition to the OpenZFS changes this patch realigns the events
with those found in OpenZFS.
Events which would be logged as sysevents on illumos have been
been mapped to the 'sysevent' class for Linux. In addition, several
subclass names have been changed to match what is used in OpenZFS.
In all cases this means a '.' was changed to an '_' in the subclass.
The scripts provided by ZoL have been updated, however users which
provide scripts for any of the following events will need to rename
them based on the new subclass names.
ereport.fs.zfs.config.sync sysevent.fs.zfs.config_sync
ereport.fs.zfs.zpool.destroy sysevent.fs.zfs.pool_destroy
ereport.fs.zfs.zpool.reguid sysevent.fs.zfs.pool_reguid
ereport.fs.zfs.vdev.remove sysevent.fs.zfs.vdev_remove
ereport.fs.zfs.vdev.clear sysevent.fs.zfs.vdev_clear
ereport.fs.zfs.vdev.check sysevent.fs.zfs.vdev_check
ereport.fs.zfs.vdev.spare sysevent.fs.zfs.vdev_spare
ereport.fs.zfs.vdev.autoexpand sysevent.fs.zfs.vdev_autoexpand
ereport.fs.zfs.resilver.start sysevent.fs.zfs.resilver_start
ereport.fs.zfs.resilver.finish sysevent.fs.zfs.resilver_finish
ereport.fs.zfs.scrub.start sysevent.fs.zfs.scrub_start
ereport.fs.zfs.scrub.finish sysevent.fs.zfs.scrub_finish
ereport.fs.zfs.bootfs.vdev.attach sysevent.fs.zfs.bootfs_vdev_attach
A port of the Illumos Crypto Framework to a Linux kernel module (found
in module/icp). This is needed to do the actual encryption work. We cannot
use the Linux kernel's built in crypto api because it is only exported to
GPL-licensed modules. Having the ICP also means the crypto code can run on
any of the other kernels under OpenZFS. I ended up porting over most of the
internals of the framework, which means that porting over other API calls (if
we need them) should be fairly easy. Specifically, I have ported over the API
functions related to encryption, digests, macs, and crypto templates. The ICP
is able to use assembly-accelerated encryption on amd64 machines and AES-NI
instructions on Intel chips that support it. There are place-holder
directories for similar assembly optimizations for other architectures
(although they have not been written).
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4329
GCC for MIPS only defines _LP64 when 64bit,
while no _ILP32 defined when 32bit.
Signed-off-by: YunQiang Su <syq@debian.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4712
hrtime_t timestruc_t and timespec_t should have originally been
included in sys/time.h so lets move them.
longlong_t is not defined by any standard so change it to long long
Signed-off-by: Carlo Landmeter <clandmeter@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4459
Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4425
sys/param.h depends on types defined in sys/types.h
(hrtime_t & timestruc_t).
Signed-off-by: Gordan Bobic <gordan@redsleeve.org>
Signed-off-by: Christopher J. Morrone <morrone2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4420
The pfn_t typedef was inherited from Illumos but never directly
used by any libspl consumers. This doesn't cause any issues in
user space but for consistency with the kernel build it has been
removed. See torvalds/linux/commit/34c0fd54.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Issue #4228
For some arm, powerpc, and sparc platforms it was possible that
neither _ILP32 of _LP64 would be defined. Update the isa_defs.h
header to explicitly set these macros and generate a compile error
in the case neither are defined.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#4048
The original P2ROUNDUP and P2ROUNDUP_TYPED macros contain -x which
triggers PaX's integer overflow detection for unsigned integers.
Replace the macros with an equivalent version that does not trigger
the overflow.
Axioms:
A. (-(x)) === (~((x) - 1)) === (~(x) + 1) under two's complement.
B. ~(x & y) === ((~(x)) | (~(y))) under De Morgan's law.
C. ~(~x) === x under the law of excluded middle.
Proof:
0. (-(-(x) & -(align))) original
1. (~(-(x) & -(align)) + 1) by A
2. (((~(-(x))) | (~(-(align)))) + 1) by B
3. (((~(~((x) - 1))) | (~(~((align) - 1)))) + 1) by A
4. (((((x) - 1)) | (((align) - 1))) + 1) by C
Q.E.D.
Signed-off-by: Jason Zaman <jason@perfinion.com>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3949
Add the required kernel side infrastructure to parse arbitrary
mount options. This enables us to support temporary mount
options in largely the same way it is handled on other platforms.
See the 'Temporary Mount Point Properties' section of zfs(8)
for complete details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#985Closes#3351
3897 zfs filesystem and snapshot limits
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3897https://github.com/illumos/illumos-gate/commit/a2afb61
Porting Notes:
dsl_dataset_snapshot_check(): reduce stack usage using kmem_alloc().
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This patch leverages Linux tracepoints from within the ZFS on Linux
code base. It also refactors the debug code to bring it back in sync
with Illumos.
The information exported via tracepoints can be used for a variety of
reasons (e.g. debugging, tuning, general exploration/understanding,
etc). It is advantageous to use Linux tracepoints as the mechanism to
export this kind of information (as opposed to something else) for a
number of reasons:
* A number of external tools can make use of our tracepoints
"automatically" (e.g. perf, systemtap)
* Tracepoints are designed to be extremely cheap when disabled
* It's one of the "accepted" ways to export this kind of
information; many other kernel subsystems use tracepoints too.
Unfortunately, though, there are a few caveats as well:
* Linux tracepoints appear to only be available to GPL licensed
modules due to the way certain kernel functions are exported.
Thus, to actually make use of the tracepoints introduced by this
patch, one might have to patch and re-compile the kernel;
exporting the necessary functions to non-GPL modules.
* Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux
tracepoints are not available for unsigned kernel modules
(tracepoints will get disabled due to the module's 'F' taint).
Thus, one either has to sign the zfs kernel module prior to
loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or
newer.
Assuming the above two requirements are satisfied, lets look at an
example of how this patch can be used and what information it exposes
(all commands run as 'root'):
# list all zfs tracepoints available
$ ls /sys/kernel/debug/tracing/events/zfs
enable filter zfs_arc__delete
zfs_arc__evict zfs_arc__hit zfs_arc__miss
zfs_l2arc__evict zfs_l2arc__hit zfs_l2arc__iodone
zfs_l2arc__miss zfs_l2arc__read zfs_l2arc__write
zfs_new_state__mfu zfs_new_state__mru
# enable all zfs tracepoints, clear the tracepoint ring buffer
$ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable
$ echo 0 > /sys/kernel/debug/tracing/trace
# import zpool called 'tank', inspect tracepoint data (each line was
# truncated, they're too long for a commit message otherwise)
$ zpool import tank
$ cat /sys/kernel/debug/tracing/trace | head -n35
# tracer: nop
#
# entries-in-buffer/entries-written: 1219/1219 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr...
z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr...
z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr...
z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr...
z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr...
z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr...
z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru...
lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr...
z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru...
lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ...
To highlight the kind of detailed information that is being exported
using this infrastructure, I've taken the first tracepoint line from the
output above and reformatted it such that it fits in 80 columns:
lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss:
hdr {
dva 0x1:0x40082
birth 15491
cksum0 0x163edbff3a
flags 0x640
datacnt 1
type 1
size 2048
spa 3133524293419867460
state_type 0
access 0
mru_hits 0
mru_ghost_hits 0
mfu_hits 0
mfu_ghost_hits 0
l2_hits 0
refcount 1
} bp {
dva0 0x1:0x40082
dva1 0x1:0x3000e5
dva2 0x1:0x5a006e
cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00
lsize 2048
} zb {
objset 0
object 0
level -1
blkid 0
}
For the specific tracepoint shown here, 'zfs_arc__miss', data is
exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and
zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!).
This kind of precise and detailed information can be extremely valuable
when trying to answer certain kinds of questions.
For anybody unfamiliar but looking to build on this, I found the XFS
source code along with the following three web links to be extremely
helpful:
* http://lwn.net/Articles/379903/
* http://lwn.net/Articles/381064/
* http://lwn.net/Articles/383362/
I should also node the more "boring" aspects of this patch:
* The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to
support a sixth paramter. This parameter is used to populate the
contents of the new conftest.h file. If no sixth parameter is
provided, conftest.h will be empty.
* The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced.
This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro,
except it has support for a fifth option that is then passed as
the sixth parameter to ZFS_LINUX_COMPILE_IFELSE.
These autoconf changes were needed to test the availability of the Linux
tracepoint macros. Due to the odd nature of the Linux tracepoint macro
API, a separate ".h" must be created (the path and filename is used
internally by the kernel's define_trace.h file).
* The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This
is to determine if we can safely enable the Linux tracepoint
functionality. We need to selectively disable the tracepoint code
due to the kernel exporting certain functions as GPL only. Without
this check, the build process will fail at link time.
In addition, the SET_ERROR macro was modified into a tracepoint as well.
To do this, the 'sdt.h' file was moved into the 'include/sys' directory
and now contains a userspace portion and a kernel space portion. The
dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as
well.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Modify the code to use the utsname() kernel function rather than
a global variable. This results is cleaner more portable code
because utsname() is already provided by the kernel and can be
easily emulated in user space via uname(2). This means that it
will behave consistently in both contexts.
This is also has the benefit that it allows the removal of a few
_KERNEL pre-processor conditions. And it also is a pre-requisite
for a proper FUSE port because we need to provide a valid utsname.
Finally, it allows us to remove this functionality from the SPL
and all the related compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
Add #ifndef PAGESIZE to avoid redefinition warning on platforms
where this value is already provided.
Signed-off-by: Alec Salazar <alec.j.salazar@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2588
Most of the code base already uses va_list, which is specified by
iso-c. gcc/glibc provides 'typedef __gnuc_va_list va_list'. and
when not using gcc/glibc we can't expect to find __gnuc_va_list.
Signed-off-by: Alec Salazar <alec.j.salazar@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2588
Using the ARM reference simulation (fast model foundation v8) I
cross compiled spl and zfs, to confirm it works on ARMv8 (64 bit
arm architecture, called aarch64 in Linux).
As it is based on previous ARM porting, the resulting patch is
disappointingly small, there was very little to do. The code fixes
the compile issues and has light testing done.
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2260
Add the minimum required ISA types to support the Sparc
architecture.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700
Four new dataset properties have been added to support SELinux. They
are 'context', 'fscontext', 'defcontext' and 'rootcontext' which map
directly to the context options described in mount(8). When one of
these properties is set to something other than 'none'. That string
will be passed verbatim as a mount option for the given context when
the filesystem is mounted.
For example, if you wanted the rootcontext for a filesystem to be set
to 'system_u:object_r:fs_t' you would set the property as follows:
$ zfs set rootcontext="system_u:object_r:fs_t" storage-pool/media
This will ensure the filesystem is automatically mounted with that
rootcontext. It is equivalent to manually specifying the rootcontext
with the -o option like this:
$ zfs mount -o rootcontext=system_u:object_r:fs_t storage-pool/media
By default all four contexts are set to 'none'. Further information
on SELinux contexts is detailed in mount(8) and selinux(8) man pages.
Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#1504
The vast majority of these changes are in Linux specific code.
They are the result of not having an automated style checker to
validate the code when it was originally written. Others were
caused when the common code was slightly adjusted for Linux.
This patch contains no functional changes. It only refreshes
the code to conform to style guide.
Everyone submitting patches for inclusion upstream should now
run 'make checkstyle' and resolve any warning prior to opening
a pull request. The automated builders have been updated to
fail a build if when 'make checkstyle' detects an issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1821
Add acl, noacl and posixacl to option_map, avoiding ENOENT error
case when mount from util-linux-2.24 execs mount.zfs with any of
those flags
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: renelson <bnelson@nelsonbe.com>
Issue #1968
In the current snapshot automount implementation, it is possible for
multiple mounts to attempted concurrently. Only one of the mounts will
succeed and the other will fail. The failed mounts will cause an EREMOTE
to be propagated back to the application.
This commit works around the problem by adding a new exit status,
MOUNT_BUSY to the mount.zfs program which is used when the underlying
mount(2) call returns EBUSY. The zfs code detects this condition and
treats it as if the mount had succeeded.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1819
3582 zfs_delay() should support a variable resolution
3584 DTrace sdt probes for ZFS txg states
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/3582illumos/illumos-gate@0689f76
Ported by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
When calling mount(), care must be taken to avoid passing in flags
that are used only by the user space utilities. Otherwise we may
stomp on flags that are reserved for other purposes in the kernel.
In particular, openSUSE 12.3 kernels have added a new MS_RICHACL
super-block flag whose value conflicts with our MS_COMMENT flag. This
causes incorrect behavior such as the umask being ignored. The
MS_COMMENT flag essentially serves as a placeholder in the option_map
data structure of zfs_mount.c, but its value is never used. Therefore
we can avoid the conflict by defining it to 0.
The MS_USERS, MS_OWNER, and MS_GROUP flags also conflict with reserved
flags in the kernel. While this is not known to have caused any
problems, it is nevertheless incorrect. For the purposes of the
mount.zfs helper, the "users", "owner", and "group" options just serve
as hints to set additional implied options. Therefore we now define
their associated mount flags in terms of the options that they imply
rather than giving them unique values.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1457