Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3542
This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.
Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite
New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
module load, the parameter will only accept first 3 options, and
the other implementations can be set once module is finished
loading. Possible values for this option are:
"fastest" - use the fastest math available
"original" - use the original raidz code
"scalar" - new scalar impl
"sse" - new SSE impl if available
"avx2" - new AVX2 impl if available
See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4328
The libzfs_graph.c source file should have been removed in 330d06f,
it is entirely unused.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4766
ZFS allows for specific permissions to be delegated to normal users
with the `zfs allow` and `zfs unallow` commands. In addition, non-
privileged users should be able to run all of the following commands:
* zpool [list | iostat | status | get]
* zfs [list | get]
Historically this functionality was not available on Linux. In order
to add it the secpolicy_* functions needed to be implemented and mapped
to the equivalent Linux capability. Only then could the permissions on
the `/dev/zfs` be relaxed and the internal ZFS permission checks used.
Even with this change some limitations remain. Under Linux only the
root user is allowed to modify the namespace (unless it's a private
namespace). This means the mount, mountpoint, canmount, unmount,
and remount delegations cannot be supported with the existing code. It
may be possible to add this functionality in the future.
This functionality was validated with the cli_user and delegation test
cases from the ZFS Test Suite. These tests exhaustively verify each
of the supported permissions which can be delegated and ensures only
an authorized user can perform it.
Two minor bug fixes were required for test-running.py. First, the
Timer() object cannot be safely created in a `try:` block when there
is an unconditional `finally` block which references it. Second,
when running as a normal user also check for scripts using the
both the .ksh and .sh suffixes.
Finally, existing users who are simulating delegations by setting
group permissions on the /dev/zfs device should revert that
customization when updating to a version with this change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#362Closes#434Closes#4100Closes#4394Closes#4410Closes#4487
New functionality:
- Preserves existing scalar implementation.
- Adds AVX2 optimized Fletcher-4 computation.
- Fastest routines selected on module load (benchmark).
- Test case for Fletcher-4 added to ztest.
New zcommon module parameters:
- zfs_fletcher_4_impl (str): selects the implementation to use.
"fastest" - use the fastest version available
"cycle" - cycle trough all available impl for ztest
"scalar" - use the original version
"avx2" - new AVX2 implementation if available
Performance comparison (Intel i7 CPU, 1MB data buffers):
- Scalar: 4216 MB/s
- AVX2: 14499 MB/s
See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl`
to get list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4330
GCC for MIPS only defines _LP64 when 64bit,
while no _ILP32 defined when 32bit.
Signed-off-by: YunQiang Su <syq@debian.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4712
Add argument format to print_one_column(), and use it to call
zfs_nicenum_format with, instead of just zfs_nicenum. Don't print "%"
for fragmentation or capacity percent values.
The calls to print_one_colum is made with ZFS_NICENUM_RAW if
cb->cb_literal (zpool list called with -p), and ZFS_NICENUM_1024 if not.
Also zpool_get_prop is modified to don't add "%" or "x" if literal.
Signed-off-by: Christer Ekholm <che@chrekh.se>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov
Closes#4657
This reverts commit 8302528617 and
ebecfcd699 which broke the build.
While these patches do apply cleanly and passed previous test
runs they need to be updated to account for the changes made in
commit 241b541574.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3878
The zfs range lock interface no longer tightly depends on a
znode_t and therefore can be used in ztest. This allows the
previous ztest specific implementation to be removed, and for
additional test coverage of the shared version.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4023
Issue #4024
Userland version of cv_timedwait_hires() always assumes absolute time.
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Denys Rtveliashvili <denys@rtveliashvili.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6739
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/41c6413
Porting Notes:
The ported change has revealed a number of problems in the Linux-specific code,
as it was expecting incorrect return codes from pthread_* functions.
Reviewed and improved the usage of pthread_* function in lib/libzpool/kernel.c.
3993 zpool(1M) and zfs(1M) should support -p for "list" and "get"
4700 "zpool get" doesn't support -H or -o options
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/3993
OpenZFS-issue: https://www.illumos.org/issues/4700
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c58b352
Porting notes:
I removed ZoL's zpool_get_prop_literal() in favor of
zpool_get_prop(..., boolean_t literal) since that's what OpenZFS
uses. The functionality is the same.
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Ported-by: Denys Rtveliashvili <denys@rtveliashvili.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
lzc_destroy_snaps() returns an nvlist in errlist.
zfs_destroy_snaps_nvl() should nvlist_free() it before returning.
OpenZFS-issue: https://www.illumos.org/issues/6873
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ee06391Closes#4614
Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Ported-by: Denys Rtveliashvili <denys@rtveliashvili.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Incorrect endianness swap for drr_spill.drr_length in libzfs_sendrecv.c
Instead of drr_write.drr_length, we should be assigning the result of
the byteswap to drr_spill.drr_length.
OpenZFS-issue: https ://www.illumos.org/issues/6879
OpenZFS-commit: https ://github.com/openzfs/openzfs/commit/74c8720
Closes#4613
Correct the arguments passed to `thread_create()`.
Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4593
While OpenSolaris libc and glibc both include XDR support, the musl libc
does not in favor of depending on the BSD-licensed libtirpc library.
Adding support is a simple matter of detecting the library, including
the headers and linking against it. By default libtirpc will be checked
for and if available used. Otherwise, configure will fall back to using
the xdr implementation provided by libc if available. The options
--with-tirpc/--without-tirpc can be used to disable this checking.
In addition, the xdr_control() function has been simplied to only
handle ZFSs specific use case.
Original-patch-by: stf <s@ctrlc.hu>
Original-patch-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Signed-off-by: Carlo Landmeter <clandmeter@gmail.com>
Closes#2254Closes#4559
6659 nvlist_free(NULL) is a no-op
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/6659https://github.com/illumos/illumos-gate/commit/aab83bb
Ported-by: David Quigley <dpquigl@davequigley.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4566
When importing a pool using the blkid cache only the device
node path was added to the list of known paths for a device.
This results in 'zpool import' always using the sdX names
in preference to the 'path' name stored in the label.
To fix the issue the blkid import path has been updated to
add both the 'path', 'devid', and 'devname' names from the
label to the known paths. A sanity check is done to ensure
these paths do refer to the same device identified by blkid.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#4523Closes#3043
Disable the additional EFI debugging in all builds. Some users
run debug builds in production and the extra log messages can
cause confusion. Beyond that the log messages are rarely useful.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#4523
When ZFS partitions a block device it must wait for udev to create
both a device node and all the device symlinks. This process takes
a variable length of time and depends on factors such how many links
must be created, the complexity of the rules, etc. Complicating
the situation further it is not uncommon for udev to create and
then remove a link multiple times while processing the udev rules.
Given the above, the existing scheme of waiting for an expected
partition to appear by name isn't 100% reliable. At this point
udev may still remove and recreate think link resulting in the
kernel modules being unable to open the device.
In order to address this the zpool_label_disk_wait() function
has been updated to use libudev. Until the registered system
device acknowledges that it in fully initialized the function
will wait. Once fully initialized all device links are checked
and allowed to settle for 50ms. This makes it far more likely
that all the device nodes will exist when the kernel modules
need to open them.
For systems without libudev an alternate zpool_label_disk_wait()
was updated to include a settle time. In addition, the kernel
modules were updated to include retry logic for this ENOENT case.
Due to the improved checks in the utilities it is unlikely this
logic will be invoked. However, if the rare event it is needed
it will prevent a failure.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#4523Closes#3708Closes#4077Closes#4144Closes#4214Closes#4517
When partitioning a device a name may be specified for each partition.
Internally zfs doesn't use this partition name for anything so it
has always just been set to "zfs".
However this isn't optimal because udev will create symlinks using
this name in /dev/disk/by-partlabel/. If the name isn't unique
then all the links cannot be created.
Therefore a random 64-bit value has been added to the partition
label, i.e "zfs-1234567890abcdef". Additional information could
be encoded here but since partitions may be reused that might
result in confusion and it was decided against.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#4517
In order to remove the HAVE_PN_UTILS wrappers the pn_alloc() and
pn_free() functions must be implemented. The existing illumos
implementation were used for this purpose.
The `flags` argument which was used in places wrapped by the
HAVE_PN_UTILS condition has beed added back to zfs_remove() and
zfs_link() functions. This removes a small point of divergence
between the ZoL code and upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4522
Current zpool import code skips directory entries which have prefixes
similar to some system files on linux such as "fd", "core" etc. However,
this means one cannot have one's zpools hosted inside files which are named
e.g. core-1 or lp. Furthermore, apart from the string checks there is already
which makes the zpool_open_func work only with regular files and block devices.
To fix this problem remove most of the checks since they are redundant but
leave the checks for the 'hpet' and 'watchdog' names. Furthermore, change
the checks to strcmp which albeit less safe than strncmp allows to have
devices whose names are prefixed by 'hpet' or 'watchdog'.
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4438
Also enable lazytime in mount.zfs
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4482
This is foundational work for ZED.
Updates a leaf vdev's persistent device strings on Linux platform
* only applies for a dedicated leaf vdev (aka whole disk)
* updated during pool create|add|attach|import
* used for matching device matching during auto-{online,expand,replace}
* stored in a leaf disk config label (i.e. alongside 'path' NVP)
* can opt-out using env var ZFS_VDEV_DEVID_OPT_OUT=YES
Some examples:
path: '/dev/sdb1'
devid: 'scsi-350000394a8ca4fbc-part1'
phys_path: 'pci-0000:04:00.0-sas-0x50000394a8ca4fbf-lun-0'
path: '/dev/mapper/mpatha'
devid: 'dm-uuid-mpath-35000c5006304de3f'
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2856Closes#3978Closes#4416
Effectively provide our own version of assert()/verify() for use
in user space. This minimizes our dependencies and aligns the
user space assertion handling with what's used in the kernel.
Signed-off-by: Carlo Landmeter <clandmeter@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4449
hrtime_t timestruc_t and timespec_t should have originally been
included in sys/time.h so lets move them.
longlong_t is not defined by any standard so change it to long long
Signed-off-by: Carlo Landmeter <clandmeter@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4459
This issue was caused by calling `thread_init()` and `thread_fini()`
multiple times resulting in `kthread_key` being invalid. To resolve
the issue the explicit calls to `thread_init()` and `thread_fini()`
required by the `zpool` command have been moved in to the command.
Consumers such as `zdb` and `zhack` perform the same initialized
through `kernel_init()` and `kernel_fini()`.
Resolving this issue allows multiple additional test cases to
be enabled.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#4331
I noticed during code review of zfsonlinux/zfs#4385 that the author of a
commit had peppered the various Makefile.am files with `$(TIRPC_LIBS)`
when putting it into `lib/libspl/Makefile.am` should have sufficed. Upon
further examination, it seems that he had copied what we do with
`$(ZLIB)`. We also have a bit of that with `-ldl` too. Unfortunately,
what we do is wrong, so lets fix it to set a good example for future
contributors.
In addition, we have multiple `-lz` and `-luuid` passed to the compiler
because each `AC_CHECK_LIB` adds it to `$LIBS`. That is somewhat
annoying to see, so we switch to `AC_SEARCH_LIBS` to avoid it. This is
consistent with the recommendation to use `AC_SEARCH_LIBS` over
`AC_CHECK_LIB` by autotools upstream:
https://www.gnu.org/software/autoconf/manual/autoconf-2.66/html_node/Libraries.html
In an ideal world, this would translate into improvements in ELF's
`DT_NEEDED` entries, but that is not the case because of a couple of
bugs in libtool.
The first bug causes libtool to overlink by using static link
dependencies for dynamic linking:
https://wiki.mageia.org/en/Overlinking_issues_in_packaging#libtool_issues
The workaround for this should be to pass `-Wl,--as-needed` in
`LDFLAGS`. That leads us to the second bug, where libtool passes
`LDFLAGS` after the libraries are specified and `ld` will only honor
`--as-needed` on libraries specified before it:
https://sigquit.wordpress.com/2011/02/16/why-asneeded-doesnt-work-as-expected-for-your-libraries-on-your-autotools-project/
There are a few possible workarounds for the second bug. One is to
either patch the compiler spec file to specify `-Wl,--as-needed` or pass
`-Wl,--as-needed` via `CC` like `CC='gcc -Wl,--as-needed'` so that it is
specified early. Another is to patch ltmain.sh like Gentoo does:
https://gitweb.gentoo.org/repo/gentoo.git/tree/eclass/ELT-patches/as-needed
Without one of those workarounds, this cleanup provides no benefit in
terms of `DT_NEEDED` entry generation. It should still be an improvement
because it nicely simplifies the code while encouraging good habits when
patching autotools scripts.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4426
Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4425
Add the ZFS Test Suite and test-runner framework from illumos.
This is a continuation of the work done by Turbo Fredriksson to
port the ZFS Test Suite to Linux. While this work was originally
conceived as a stand alone project integrating it directly with
the ZoL source tree has several advantages:
* Allows the ZFS Test Suite to be packaged in zfs-test package.
* Facilitates easy integration with the CI testing.
* Users can locally run the ZFS Test Suite to validate ZFS.
This testing should ONLY be done on a dedicated test system
because the ZFS Test Suite in its current form is destructive.
* Allows the ZFS Test Suite to be run directly in the ZoL source
tree enabled developers to iterate quickly during development.
* Developers can easily add/modify tests in the framework as
features are added or functionality is changed. The tests
will then always be in sync with the implementation.
Full documentation for how to run the ZFS Test Suite is available
in the tests/README.md file.
Warning: This test suite is designed to be run on a dedicated test
system. It will make modifications to the system including, but
not limited to, the following.
* Adding new users
* Adding new groups
* Modifying the following /proc files:
* /proc/sys/kernel/core_pattern
* /proc/sys/kernel/core_uses_pid
* Creating directories under /
Notes:
* Not all of the test cases are expected to pass and by default
these test cases are disabled. The failures are primarily due
to assumption made for illumos which are invalid under Linux.
* When updating these test cases it should be done in as generic
a way as possible so the patch can be submitted back upstream.
Most existing library functions have been updated to be Linux
aware, and the following functions and variables have been added.
* Functions:
* is_linux - Used to wrap a Linux specific section.
* block_device_wait - Waits for block devices to be added to /dev/.
* Variables: Linux Illumos
* ZVOL_DEVDIR "/dev/zvol" "/dev/zvol/dsk"
* ZVOL_RDEVDIR "/dev/zvol" "/dev/zvol/rdsk"
* DEV_DSKDIR "/dev" "/dev/dsk"
* DEV_RDSKDIR "/dev" "/dev/rdsk"
* NEWFS_DEFAULT_FS "ext2" "ufs"
* Many of the disabled test cases fail because 'zfs/zpool destroy'
returns EBUSY. This is largely causes by the asynchronous nature
of device handling on Linux and is expected, the impacted test
cases will need to be updated to handle this.
* There are several test cases which have been disabled because
they can trigger a deadlock. A primary example of this is to
recursively create zpools within zpools. These tests have been
disabled until the root issue can be addressed.
* Illumos specific utilities such as (mkfile) should be added to
the tests/zfs-tests/cmd/ directory. Custom programs required by
the test scripts can also be added here.
* SELinux should be either is permissive mode or disabled when
running the tests. The test cases should be updated to conform
to a standard policy.
* Redundant test functionality has been removed (zfault.sh).
* Existing test scripts (zconfig.sh) should be migrated to use
the framework for consistency and ease of testing.
* The DISKS environment variable currently only supports loopback
devices because of how the ZFS Test Suite expects partitions to
be named (p1, p2, etc). Support must be added to generate the
correct partition name based on the device location and name.
* The ZFS Test Suite is part of the illumos code base at:
https://github.com/illumos/illumos-gate/tree/master/usr/src/test
Original-patch-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes#6Closes#1534
sys/param.h depends on types defined in sys/types.h
(hrtime_t & timestruc_t).
Signed-off-by: Gordan Bobic <gordan@redsleeve.org>
Signed-off-by: Christopher J. Morrone <morrone2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4420
zfsonlinux issue #2217 - zvol minor operations: check snapdev
property before traversing snapshots of a dataset
zfsonlinux issue #3681 - lock order inversion between zvol_open()
and dsl_pool_sync()...zvol_rename_minors()
Create a per-pool zvol taskq for asynchronous zvol tasks.
There are a few key design decisions to be aware of.
* Each taskq must be single threaded to ensure tasks are always
processed in the order in which they were dispatched.
* There is a taskq per-pool in order to keep the pools independent.
This way if one pool is suspended it will not impact another.
* The preferred location to dispatch a zvol minor task is a sync
task. In this context there is easy access to the spa_t and
minimal error handling is required because the sync task must
succeed.
Support for asynchronous zvol minor operations address issue #3681.
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2217Closes#3678Closes#3681
Added by-partlabel and by-partuuid to the default device search
path. Made made device names in by-label more preferable.
Signed-off-by: Thijs Cramer <thijs.cramer@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3892
Historically libblkid support was detected as part of configure
and optionally enabled. This was done because at the time support
for detecting ZFS pool vdevs had just be added to libblkid and
those updated packages were not yet part of many distributions.
This is no longer the case and any reasonably current distribution
will ship a version of libblkid which can detect ZFS pool vdevs.
This patch makes libblkid mandatory at build time and libblkid
the preferred method of scanning for ZFS pools. For distributions
which include a modern version of libblkid there is no change in
behavior. Explicitly scanning the default search paths is still
supported and can be enabled with the '-s' command line option.
Additionally making libblkid mandatory means that the 'zpool create'
command can reliably detect if a specified device has an existing
non-ZFS filesystem (ext4, xfs) and print a warning.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2448
print_vdev_stats() subtracts the old bandwidth/ops stats from the new stats
to calculate the bandwidth/ops numbers in "zpool iostat". However when the
TXG numbers change between stats, zpool_refresh_stats() will incorrectly assign
a NULL to the old stats. This causes print_vdev_stats() to use zeroes for
the old bandwidth/ops numbers, resulting in an inaccurate calculation.
This fix allows the calculation to happen even when TXGs change.
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4387
The following options have been added to the zpool add, iostat,
list, status, and split subcommands. The default behavior was
not modified, from zfs(8).
-g Display vdev GUIDs instead of the normal short
device names. These GUIDs can be used in-place of
device names for the zpool detach/off‐
line/remove/replace commands.
-L Display real paths for vdevs resolving all symbolic
links. This can be used to lookup the current block
device name regardless of the /dev/disk/ path used
to open it.
-p Display full paths for vdevs instead of only the
last component of the path. This can be used in
conjunction with the -L flag.
This behavior may also be enabled using the following environment
variables.
ZPOOL_VDEV_NAME_GUID
ZPOOL_VDEV_NAME_FOLLOW_LINKS
ZPOOL_VDEV_NAME_PATH
This change is based on worked originally started by Richard Yao
to add a -g option. Then extended by @ilovezfs to add a -L option
for openzfsonosx. Those changes have been merged, re-factored,
a -p option added and extended to all relevant zpool subcommands.
Original-patch-by: Richard Yao <ryao@gentoo.org>
Extended-by: ilovezfs <ilovezfs@icloud.com>
Extended-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: ilovezfs <ilovezfs@icloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2011Closes#4341
When extracting tokens from the string strtok(2) is allowed to modify
the passed buffer. Therefore the zfs_strcmp_pathname() function must
make a copy of the passed string before passing it to strtok(3).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes#4312
LLVM's static analyzer showed that we could subtract using an
uninitialized value on an error from vn_rdwr().
The correct behavior is to return -1 on an error, so lets do that
instead.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4104
1778 Assertion failed: rn->rn_nozpool == B_FALSE
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
References:
https://www.illumos.org/issues/1778https://github.com/illumos/illumos-gate/commit/bd0f709
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
5518 Memory leaks in libzfs import implementation
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serghei Samsi <sscdvp@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5518https://github.com/illumos/illumos-gate/commit/078266a
Porting notes:
- One hunk of this change was already applied independently in
commit 4def05f.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
6815179 zpool import with a large number of LUNs is too slow
6844191 zpool import, scanning of disks should be multi-threaded
References:
https://github.com/illumos/illumos-gate/commit/4f67d75
Porting notes:
- This change was originally never ported to Linux due to it
dependence on the thread pool interface. This patch solves
that issue by switching the code to use the existing taskq
implementation which provides the same basic functionality.
However, in order for this to work properly thread_init()
and thread_fini() must be called around to taskq consumer
to perform the needed thread initialization.
- The check_one_slice, nozpool_all_slices, and check_slices
functions have been disabled for Linux. They are difficult,
but possible, to implement for Linux due to how partitions
are get names. Since this is only an optimization this code
can be added at a latter date.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The pfn_t typedef was inherited from Illumos but never directly
used by any libspl consumers. This doesn't cause any issues in
user space but for consistency with the kernel build it has been
removed. See torvalds/linux/commit/34c0fd54.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Issue #4228
4953 zfs rename <snapshot> need not involve libshare
4954 "zfs create" need not involve libshare if we are not sharing
4955 libshare's get_zfs_dataset need not sort the datasets
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/4953https://www.illumos.org/issues/4954https://www.illumos.org/issues/4955https://github.com/illumos/illumos-gate/commit/33cde0d
Porting notes:
- Dropped qsort libshare_zfs.c hunk, no equivalent ZoL code.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4219
6298 zfs_create_008_neg and zpool_create_023_neg need to be updated
for large block support
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/6298https://github.com/illumos/illumos-gate/commit/e9316f7
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4217
3557 dumpvp_size is not updated correctly when a dump zvol's size is changed
3558 setting the volsize on a dump device does not return back ENOSPC
3559 setting a volsize larger than the space available sometimes succeeds
3560 dumpadm should be able to remove a dump device
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Albert Lee <trisk@nexenta.com>
References:
https://www.illumos.org/issues/3559https://github.com/illumos/illumos-gate/commit/c61ea56
Porting notes:
- Internal zvol.c changes not applied due to implementation differences.
The external interface and behavior was already consistent with the
latest upstream code.
- Retired 2.6.28 HAVE_CHECK_DISK_SIZE_CHANGE configure check. All
supported kernels (2.6.32 and newer) provide this interface.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4217
6280 libzfs: unshare_one() could fail with EZFS_SHARENFSFAILED
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/6280https://github.com/illumos/illumos-gate/commit/d1672ef
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
6358 A faulted pool with only unavailable vdevs triggers assertion
failure in libzfs
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Serban Maduta <serban.maduta@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://illumos.org/issues/6358https://github.com/illumos/illumos-gate/commit/b289d04
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
kernel.c: In function 'vn_rdwr':
kernel.c:736:8: warning: unused variable 'status' [-Wunused-variable]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
4891 want zdb option to dump all metadata
Reviewed by: Sonu Pillai <sonu.pillai@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
We'd like a way for zdb to dump metadata in a machine-readable
format, so that we can bring that back from a customer site for
in-house diagnosis. Think of it as a crash dump for zpools,
which can be used for post-mortem analysis of a malfunctioning
pool
References:
https://www.illumos.org/issues/4891https://github.com/illumos/illumos-gate/commit/df15e41
Porting notes:
- [cmd/zdb/zdb.c]
- a5778ea zdb: Introduce -V for verbatim import
- In main() getopt 'opt' variable removed and the code was
brought back in line with illumos.
- [lib/libzpool/kernel.c]
- 1e33ac1 Fix Solaris thread dependency by using pthreads
- f0e324f Update utsname support
- 4d58b69 Fix vn_open/vn_rdwr error handling
- In vn_open() allocate 'dumppath' on heap instead of stack
- Properly handle 'dump_fd == -1' error path
- Free 'realpath' after added vn_dumpdir_code block
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
5745 zfs set allows only one dataset property to be set at a time
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Richard PALO <richard@NetBSD.org>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Rich Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/5745https://github.com/illumos/illumos-gate/commit/3092556
Porting notes:
- Fix the missing braces around initializer, zfs_cmd_t zc = {"\0"};
- Remove extra format argument in zfs_do_set()
- Declare at the top:
- zfs_prop_t prop;
- nvpair_t *elem;
- nvpair_t *next;
- int i;
- Additionally initialize:
- int added_resv = 0;
- zfs_prop_t prop = 0;
- Assign 0 install of NULL for uint64_t types.
- zc->zc_nvlist_conf = '\0';
- zc->zc_nvlist_src = '\0';
- zc->zc_nvlist_dst = '\0';
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3574
For some arm, powerpc, and sparc platforms it was possible that
neither _ILP32 of _LP64 would be defined. Update the isa_defs.h
header to explicitly set these macros and generate a compile error
in the case neither are defined.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#4048
Continuations should be indented four spaces.
Signed-off-by: ilovezfs <ilovezfs@icloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4062
Adding additional entries to the efi conversion array will help prevent
the overwriting of the GPTs of disks with in-use file systems in more
cases. Most notably, this adds partition type 8300 "Linux filesystem"
(0FC63DAF-8483-4772-8E79-3D69D8477DE4), which is often used for ext4 and
btrfs, among others.
This commit itself does nothing to address the underlying problematic
behavior that check_slice() isn't called on partitions of an
unrecognized type, even when they contain a currently mounted file
system.
The additional entries were derived from these two resources:
https://en.wikipedia.org/wiki/GUID_Partition_Tablehttp://sourceforge.net/p/gptfdisk/code/ci/master/tree/parttypes.cc
Signed-off-by: ilovezfs <ilovezfs@icloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4016
The original P2ROUNDUP and P2ROUNDUP_TYPED macros contain -x which
triggers PaX's integer overflow detection for unsigned integers.
Replace the macros with an equivalent version that does not trigger
the overflow.
Axioms:
A. (-(x)) === (~((x) - 1)) === (~(x) + 1) under two's complement.
B. ~(x & y) === ((~(x)) | (~(y))) under De Morgan's law.
C. ~(~x) === x under the law of excluded middle.
Proof:
0. (-(-(x) & -(align))) original
1. (~(-(x) & -(align)) + 1) by A
2. (((~(-(x))) | (~(-(align)))) + 1) by B
3. (((~(~((x) - 1))) | (~(~((align) - 1)))) + 1) by A
4. (((((x) - 1)) | (((align) - 1))) + 1) by C
Q.E.D.
Signed-off-by: Jason Zaman <jason@perfinion.com>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3949
As of gcc 5.1.1 20150618 (Red Hat 5.1.1-4) the -Werror=maybe-uninitialized
check detects that 'snapname' in recv_incremental_replication() may not be
initialized. Explicitly initialize the variable to resolved the warning.
libzfs_sendrecv.c: In function ‘recv_incremental_replication’:
libzfs_sendrecv.c:2019:2: error: ‘snapname’ may be used uninitialized in
(void) snprintf(buf, sizeof (buf), "%s@%s", fsname, snapname);
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
EDOM may occur if a user tries to set `recordsize` too large without
use "zfs set". This can be demonstrated with:
> zpool create testpool -O recordsize=32M /dev/...
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3911
ZFS incorrectly uses directory-based extended attributes even when
xattr=sa is specified as a dataset property or mount option. Support to
honor temporary mount options including "xattr" was added in commit
0282c4137e. There are two issues with the
mount option handling:
* Libzfs has historically included "xattr" in its list of default mount
options. This overrides the dataset property, so the dataset is always
configured to use directory-based xattrs even when the xattr dataset
property is set to off or sa. Address this by removing "xattr" from
the set of default mount options in libzfs.
* There was no way to enable system attribute-based extended attributes
using temporary mount options. Add the mount options "saxattr" and
"dirxattr" which enable the xattr behavior their names suggest. This
approach has the advantages of mirroring the valid xattr dataset
property values and following existing conventions for mount option
names.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3787
Add the required kernel side infrastructure to parse arbitrary
mount options. This enables us to support temporary mount
options in largely the same way it is handled on other platforms.
See the 'Temporary Mount Point Properties' section of zfs(8)
for complete details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#985Closes#3351
Since ZoL allows large blocks to be used by volumes, unlike upstream
illumos, the feature flag must be checked prior to volume creation.
This is critical because unlike filesystems, volumes will create a
object which uses large blocks as part of the create. Therefore, it
cannot be safely checked in zfs_check_settable() after the dataset
can been created.
In addition this patch updates the relevant error messages to use
zfs_nicenum() to print the maximum blocksize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3591
The zpool create subcomand should not return an error on debug builds of
the userland tools when given zvols.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3595
Under Linux filesystem threads responsible for handling I/O are
normally created with the maximum priority. Non-I/O filesystem
processes run with the default priority. ZFS should adopt the
same priority scheme under Linux to maintain good performance
and so that it will complete fairly when other Linux filesystems
are active. The priorities have been updated to the following:
$ ps -eLo rtprio,cls,pid,pri,nice,cmd | egrep 'z_|spl_|zvol|arc|dbu|meta'
- TS 10743 19 -20 [spl_kmem_cache]
- TS 10744 19 -20 [spl_system_task]
- TS 10745 19 -20 [spl_dynamic_tas]
- TS 10764 19 0 [dbu_evict]
- TS 10765 19 0 [arc_prune]
- TS 10766 19 0 [arc_reclaim]
- TS 10767 19 0 [arc_user_evicts]
- TS 10768 19 0 [l2arc_feed]
- TS 10769 39 0 [z_unmount]
- TS 10770 39 -20 [zvol]
- TS 11011 39 -20 [z_null_iss]
- TS 11012 39 -20 [z_null_int]
- TS 11013 39 -20 [z_rd_iss]
- TS 11014 39 -20 [z_rd_int_0]
- TS 11022 38 -19 [z_wr_iss]
- TS 11023 39 -20 [z_wr_iss_h]
- TS 11024 39 -20 [z_wr_int_0]
- TS 11032 39 -20 [z_wr_int_h]
- TS 11033 39 -20 [z_fr_iss_0]
- TS 11041 39 -20 [z_fr_int]
- TS 11042 39 -20 [z_cl_iss]
- TS 11043 39 -20 [z_cl_int]
- TS 11044 39 -20 [z_ioctl_iss]
- TS 11045 39 -20 [z_ioctl_int]
- TS 11046 39 -20 [metaslab_group_]
- TS 11050 19 0 [z_iput]
- TS 11121 38 -19 [z_wr_iss]
Note that under Linux the meaning of a processes priority is inverted
with respect to illumos. High values on Linux indicate a _low_ priority
while high value on illumos indicate a _high_ priority.
In order to preserve the logical meaning of the minclsyspri and
maxclsyspri macros when they are used by the illumos wrapper functions
their values have been inverted. This way when changes are merged
from upstream illumos we won't need to remember to invert the macro.
It could also lead to confusion.
This patch depends on https://github.com/zfsonlinux/spl/pull/466.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#3607
5376 arc_kmem_reap_now() should not result in clearing arc_no_grow
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5376https://github.com/illumos/illumos-gate/commit/2ec99e3
Porting Notes:
The good news is that many of the recent changes made upstream to the
ARC tackled issues previously observed by ZoL with similar solutions.
The bad news is those solution weren't identical to the ones we applied.
This patch is designed to split the difference and apply as much of the
upstream work as possible.
* The arc_available_memory() function was removed previous in ZoL but
due to the upstream changes it makes sense to add it back. This function
has been customized for Linux so that it can be used to determine a low
memory. This provides the same basic functionality as the illumos version
allowing us to minimize changes through the rest of the code base. The
exact mechanism used to detect a low memory state remains unchanged so
this change isn't a significant as it might first appear.
* This patch includes the long standing fix for arc_shrink() which was
originally proposed in #2167. Since there were related changes to this
function it made sense to include that work.
* The arc_init() function has been re-factored. As before it sets sane
default values for the ARC but then calls arc_tuning_update() to apply
user specific tuning made via module options. The arc_tuning_update()
function is then called periodically by the arc_reclaim_thread() to
apply changes to the tunings made during normal operation.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3616Closes#2167
Build products from an out of tree build should be written
relative to the build directory. Sources should be referred
to by their locations in the source directory.
This is accomplished by adding the 'src' and 'obj' variables
for the module Makefile.am, using relative paths to reference
source files, and by setting VPATH when source files are not
co-located with the Makefile. This enables the following:
$ mkdir build
$ cd build
$ ../configure \
--with-spl=$HOME/src/git/spl/ \
--with-spl-obj=$HOME/src/git/spl/build
$ make -s
This change also has the advantage of resolving the following
warning which is generated by modern versions of automake.
Makefile.am:00: warning: source file 'xxx' is in a subdirectory,
Makefile.am:00: but option 'subdir-objects' is disabled
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1082
We fail to make partitions on xvd (Xen Virtual Block) devices. This also
causes debug builds of zpool create to return an error when given xen
virtual block devices. These devices should be given the same treatment
as vd (KVM Virtual Block) devices, so we adjust the relevant code paths.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3576
5163 arc should reap range_seg_cache
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5163https://github.com/illumos/illumos-gate/commit/83803b5
Porting Notes:
Added umem_cache_reap_now() wrapped to suppress unused variable
warning for user space build in arc_kmem_reap_now().
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Over the years the default values for the taskqs used on Linux have
differed slightly from illumos. In the vast majority of cases this
was done to avoid creating an obnoxious number of idle threads which
would pollute the process listing.
With the addition of support for dynamic taskqs all multi-threaded
queues should be created as dynamic taskqs. This allows us to get
the best of both worlds.
* The illumos default values for the I/O pipeline can be restored.
These values are known to work well for most workloads. The only
exception is the zio write interrupt taskq which is changed to
ZTI_P(12, 8). At least under Linux more threads has been shown
to improve performance, see commit 7e55f4e.
* Reduces the number of idle threads on the system when it's not
under heavy load. The maximum number of threads will only be
created when they are required.
* Remove the vdev_file_taskq and rely on the system_taskq instead
which is now dynamic and may have up to 64-threads. Again this
brings us back inline with upstream.
* Tasks dispatched with taskq_dispatch_ent() are allowed to use
dynamic taskqs. The Linux taskq implementation supports this.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#3507
Added for upstream compatibility, they are of the form:
* IMPLY(a, b) - if (a) then (b)
* EQUIV(a, b) - if (a) then (b) *AND* if (b) then (a)
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
SPL commit behlendorf/spl@9cef1b5 adds the taskq_wait_outstanding()
interface. See the commit log for the full justification for this
addition. This patch adds the required user space counterpart.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Porting notes and other significant code changes:
The illumos 5368 patch (ARC should cache more metadata), which
was never picked up by ZoL, is mostly reverted by this patch.
Since ZoL relies on the kernel asynchronously calling the shrinker to
actually reap memory, the shrinker wakes up arc_reclaim_waiters_cv every
time it runs.
The arc_adapt_thread() function no longer calls arc_do_user_evicts()
since the newly-added arc_user_evicts_thread() calls it periodically.
Notable conflicting ZoL commits which conflicted with this patch or
whose effects are either duplicated or un-done by this patch:
302f753 - Integrate ARC more tightly with Linux
39e055c - Adjust arc_p based on "bytes" in arc_shrink
f521ce1 - Allow "arc_p" to drop to zero or grow to "arc_c"
77765b5 - Remove "arc_meta_used" from arc_adjust calculation
94520ca - Prune metadata from ghost lists in arc_adjust_meta
Trace support for multilist_insert() and multilist_remove() has been
added and produces the following output:
fio-12498 [077] .... 112936.448324: zfs_multilist__insert: ml { offset 240 numsublists 80 sublistidx 63 }
fio-12498 [077] .... 112936.448347: zfs_multilist__remove: ml { offset 240 numsublists 80 sublistidx 29 }
The following arcstats have been removed:
recycle_miss - Used by arcstat.py and arc_summary.py, both of which
have been updated appropriately.
l2_writes_hdr_miss
The following arcstats have been added:
evict_not_enough - Number of times arc_evict_state() was unable to
evict enough buffers to reach its target amount.
evict_l2_skip - Number of times arc_evict_hdr() skipped eviction
because it was being written to the l2arc.
l2_writes_lock_retry - Replaces l2_writes_hdr_miss. Number of times
l2arc_write_done() failed to acquire hash_lock (and re-tries).
arc_meta_min - Shows the value of the zfs_arc_meta_min module
parameter (see below).
The "index" column of the "dbuf" kstat has been removed since it doesn't
have a direct analog in the new multilist scheme. Additional multilist-
related stats could be added in the future but would likely require
extensions to the mulilist API.
The following module parameters have been added:
zfs_arc_evict_batch_limit - Number of ARC headers to free per sub-list
before moving on to the next sub-list.
zfs_arc_meta_min - Enforce a floor on the amount of metadata in
the ARC.
zfs_arc_num_sublists_per_state - Number of multilist sub-lists per
ARC state.
zfs_arc_overflow_shift - Controls amount by which the ARC must exceed
the target size to be considered "overflowing".
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov
All fprintf() error messages are moved out of the libzfs_init()
library function where they never belonged in the first place. A
libzfs_error_init() function is added to provide useful error
messages for the most common causes of failure.
Additionally, in libzfs_run_process() the 'rc' variable was renamed
to 'error' for consistency with the rest of the code base.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
While module loading itself is synchronous the creation of the /dev/zfs
device is not. This is because /dev/zfs is typically created by a udev
rule after the module is registered and presented to user space through
sysfs. This small window between module loading and device creation
can result in spurious failures of libzfs_init().
This patch closes that race by extending libzfs_init() so it can detect
that the modules are loaded and only if required wait for the /dev/zfs
device to be created. This allows scripts to reliably use the following
shell construct without the need for additional error handling.
$ /sbin/modprobe zfs && /sbin/zpool import -a
To minimize the potential time waiting in libzfs_init() a strategy
similar to adaptive mutexes is employed. The function will busy-wait
for up to 10ms based on the expectation that the modules were just
loaded and therefore the /dev/zfs will be created imminently. If it
takes longer than this it will fall back to polling for up to 10 seconds.
This behavior can be customized to some degree by setting the following
new environment variables. This functionality is provided for backwards
compatibility with existing scripts which depend on the module auto-load
behavior. By default module auto-loading is now disabled.
* ZFS_MODULE_LOADING="YES|yes|ON|on" - Attempt to load modules.
* ZFS_MODULE_TIMEOUT="<seconds>" - Seconds to wait for /dev/zfs
The zfs-import-* systemd service files have been updated to call
'/sbin/modprobe zfs' so they no longer rely on the legacy auto-loading
behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#2556
Prefixing an octal value with a leading zero is the standard way
to disambiguate it. This change only impacts the `zfs diff` output
and is therefore very limited in scope.
Signed-off-by: Hajo M<C3><B6>ller <dasjoe@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3417
5765 add support for estimating send stream size with lzc_send_space when source is a bookmark
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Albert Lee <trisk@nexenta.com>
References:
https://www.illumos.org/issues/5765https://github.com/illumos/illumos-gate/commit/643da460
Porting notes:
* Unused variable 'recordsize' in dmu_send_estimate() dropped
Ported-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3397
5027 zfs large block support
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5027https://github.com/illumos/illumos-gate/commit/b515258
Porting Notes:
* Included in this patch is a tiny ISP2() cleanup in zio_init() from
Illumos 5255.
* Unlike the upstream Illumos commit this patch does not impose an
arbitrary 128K block size limit on volumes. Volumes, like filesystems,
are limited by the zfs_max_recordsize=1M module option.
* By default the maximum record size is limited to 1M by the module
option zfs_max_recordsize. This value may be safely increased up to
16M which is the largest block size supported by the on-disk format.
At the moment, 1M blocks clearly offer a significant performance
improvement but the benefits of going beyond this for the majority
of workloads are less clear.
* The illumos version of this patch increased DMU_MAX_ACCESS to 32M.
This was determined not to be large enough when using 16M blocks
because the zfs_make_xattrdir() function will fail (EFBIG) when
assigning a TX. This was immediately observed under Linux because
all newly created files must have a security xattr created and
that was failing. Therefore, we've set DMU_MAX_ACCESS to 64M.
* On 32-bit platforms a hard limit of 1M is set for blocks due
to the limited virtual address space. We should be able to relax
this one the ABD patches are merged.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#354
The umem_alloc_aligned() function should not assume that a 'void *'
type is 64-bit. It will not be on 32-bit platforms. Rather than
complicating the ASSERT to handle this it is simply removed.
Additionally, the '%lu' format specifier should not be assumed to
imply a 64-bit value. Fix this by using the 'llu' format specifier
which will always be atleast 64-bit and explicitly casing the
variable to an u_longlong_t. This issue is handled the same way
in many other parts of the code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
5244 zio pipeline callers should explicitly invoke next stage
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/5244https://github.com/illumos/illumos-gate/commit/738f37b
Porting Notes:
1. The unported "2932 support crash dumps to raidz, etc. pools"
caused a merge conflict due to a copyright difference in
module/zfs/vdev_raidz.c.
2. The unported "4128 disks in zpools never go away when pulled"
and additional Linux-specific changes caused merge conflicts in
module/zfs/vdev_disk.c.
Ported-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2828
3897 zfs filesystem and snapshot limits
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3897https://github.com/illumos/illumos-gate/commit/a2afb61
Porting Notes:
dsl_dataset_snapshot_check(): reduce stack usage using kmem_alloc().
Ported-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
5134 if ZFS_DEBUG or debug= is set, libzpool should enable debug prints
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/projects/illumos-gate/issues/5134https://github.com/illumos/illumos-gate/commit/7fa49ea
Porting notes:
Added dprintf_setup() to main in zfs_main.c and zpool_main.c.
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2669
As of this commit, volumes do not possess the version property in
any existing OpenZFS implementation. The zpool upgrade code, however,
uses zfs_prop_get_int() to fetch the version property of all children in
a pool. The semantics of the function, however, demand that it only be
used for known valid properties so it returns a garbage value for volumes.
This patch causes the version of a volume to appear to callers using
the zfs_prop_get_int() API to be that of the default ZPL version for
the implementation. In the future, should volumes gain the property,
its actual value will be used.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#3313
Make the 'zpool import' command honor the overlay property to allow
filesystems to be mounted on a non-empty directory. As it stands now
this property is only checked by the 'zfs mount' command. Move the
check into 'zfs_mount()` in libzpool so the property is honored for all
callers.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3227
When using 'zpool import' to scan for available pools prefer vdev names
which reference vdevs with more valid labels. There should be two labels
at the start of the device and two labels at the end of the device. If
labels are missing then the device has been damaged or is in some other
way incomplete. Preferring names with fully intact labels helps weed out
bad paths and improves the likelihood of being able to import the pool.
This behavior only applies when scanning /dev/ for valid pools. If a
cache file exists the pools described by the cache file will be used.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Closes#3145Closes#2844Closes#3107
I noticed when reviewing documentation that it is possible for user
space to use fctnl(fd, F_SETPIPE_SZ, (unsigned long) size) to change
the kernel pipe buffer size on Linux to increase the pipe size up to
the value specified in /proc/sys/fs/pipe-max-size. There are users using
mbuffer to improve zfs recv performance when piping over the network, so
it seems advantageous to integrate such functionality directly into the
zfs recv tool. This avoids the addition of two buffers and two copies
(one for the buffer mbuffer adds and another for the additional pipe),
so it should be more efficient.
This could have been made configurable and/or this could have changed
the value back to the original after we were done with the file
descriptor, but I do not see a strong case for doing either, so I
went with a simple implementation.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1161
The function zfs_nicenum that converts number to human-readable output
uses a index to a string of letters. This patch limits the index to
the length of the string.
Signed-off-by: Christer Ekholm <che@chrekh.se>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3122
There have been multiple reports of 'zdb' tripping the VERIFY in
mutex_destroy() because pthread_mutex_destroy() returns EBUSY.
Exactly how this can happen still needs to be explained, but this
doesn't strictly need to be fatal for non-debug builds. Therefore,
this patch converts the VERIFY to an ASSERT until the root cause
is determined and resolved.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2027
In order to accelerate zfs receive operations in the face of many
property-containing snapshots, commit 0574855 changed the header nvlist
("fss") of a send stream to exclude snapshots which aren't part of the
stream. This, however, would cause zfs receive -F to erroneously remove
snapshots; it would remove any snapshot which wasn't listed in the header
nvlist.
This patch restores the full list of snapshots in fss[<id>[snaps]] but
still suppresses the properties of non-sent snapshots and also removes a
consistency check in which an error is raised if a listed snapshot does
not have any properties in fss[<id>[snapprops]].
The 0574855 commit also introduced a bug in which zfs send -p of a
complete stream (zfs send -p pool/fs@snap) would exclude the snapshot
properties in fss[<id>[snapprops]]. This patch detects the last snapshot
in a series when no "from" snapshot has been specified and includes its
properties.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2907
If spl_hostid is set via module parameter, it's likely different from
gethostid(). Therefore, the userspace tool should read it first before
falling back to gethostid().
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3034
This was caught by Clang. Clearly the intent of this code was
to explicitly ignore the return value.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3054
When importing pools with zpool import -aN there is inconsistent
behavior between '-d /dev/disk/by-id' (or another path) and
'-c /etc/zfs/zpool.cache'.
The difference in behavior is caused by zpool_find_import_cached()
returning an empty nvlist_t when there are no pools to import but
zpool_find_import_impl() returns NULL for the same situation. The
behavior of zpool_find_import_cached() is arguably more correct
because it allows returning NULL to be used for an error case and
not an empty set.
This change resolves the issue by updating get_configs() such that
it returns an empty set instead of NULL when no config is found.
The updated behavior will now always return 0 for this case.
$ zpool import -aN; echo $?
no pools available to import
0
$ zpool import -aN -d /var/tmp/; echo $?
no pools available to import
0
$ zpool import -aN -c /etc/zfs/zpool.cache; echo $?
no pools available to import
0
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2080
In order to avoid deadlocking in the IO pipeline it is critical that
pageout be avoided during direct memory reclaim. This ensures that
the pipeline threads can always make forward progress and never end
up blocking on a DMU transaction. For this very reason Linux now
provides the PF_FSTRANS flag which may be set in the process context.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This patch leverages Linux tracepoints from within the ZFS on Linux
code base. It also refactors the debug code to bring it back in sync
with Illumos.
The information exported via tracepoints can be used for a variety of
reasons (e.g. debugging, tuning, general exploration/understanding,
etc). It is advantageous to use Linux tracepoints as the mechanism to
export this kind of information (as opposed to something else) for a
number of reasons:
* A number of external tools can make use of our tracepoints
"automatically" (e.g. perf, systemtap)
* Tracepoints are designed to be extremely cheap when disabled
* It's one of the "accepted" ways to export this kind of
information; many other kernel subsystems use tracepoints too.
Unfortunately, though, there are a few caveats as well:
* Linux tracepoints appear to only be available to GPL licensed
modules due to the way certain kernel functions are exported.
Thus, to actually make use of the tracepoints introduced by this
patch, one might have to patch and re-compile the kernel;
exporting the necessary functions to non-GPL modules.
* Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux
tracepoints are not available for unsigned kernel modules
(tracepoints will get disabled due to the module's 'F' taint).
Thus, one either has to sign the zfs kernel module prior to
loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or
newer.
Assuming the above two requirements are satisfied, lets look at an
example of how this patch can be used and what information it exposes
(all commands run as 'root'):
# list all zfs tracepoints available
$ ls /sys/kernel/debug/tracing/events/zfs
enable filter zfs_arc__delete
zfs_arc__evict zfs_arc__hit zfs_arc__miss
zfs_l2arc__evict zfs_l2arc__hit zfs_l2arc__iodone
zfs_l2arc__miss zfs_l2arc__read zfs_l2arc__write
zfs_new_state__mfu zfs_new_state__mru
# enable all zfs tracepoints, clear the tracepoint ring buffer
$ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable
$ echo 0 > /sys/kernel/debug/tracing/trace
# import zpool called 'tank', inspect tracepoint data (each line was
# truncated, they're too long for a commit message otherwise)
$ zpool import tank
$ cat /sys/kernel/debug/tracing/trace | head -n35
# tracer: nop
#
# entries-in-buffer/entries-written: 1219/1219 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr...
z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr...
z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr...
z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr...
z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr...
z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr...
z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru...
lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr...
z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru...
lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ...
To highlight the kind of detailed information that is being exported
using this infrastructure, I've taken the first tracepoint line from the
output above and reformatted it such that it fits in 80 columns:
lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss:
hdr {
dva 0x1:0x40082
birth 15491
cksum0 0x163edbff3a
flags 0x640
datacnt 1
type 1
size 2048
spa 3133524293419867460
state_type 0
access 0
mru_hits 0
mru_ghost_hits 0
mfu_hits 0
mfu_ghost_hits 0
l2_hits 0
refcount 1
} bp {
dva0 0x1:0x40082
dva1 0x1:0x3000e5
dva2 0x1:0x5a006e
cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00
lsize 2048
} zb {
objset 0
object 0
level -1
blkid 0
}
For the specific tracepoint shown here, 'zfs_arc__miss', data is
exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and
zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!).
This kind of precise and detailed information can be extremely valuable
when trying to answer certain kinds of questions.
For anybody unfamiliar but looking to build on this, I found the XFS
source code along with the following three web links to be extremely
helpful:
* http://lwn.net/Articles/379903/
* http://lwn.net/Articles/381064/
* http://lwn.net/Articles/383362/
I should also node the more "boring" aspects of this patch:
* The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to
support a sixth paramter. This parameter is used to populate the
contents of the new conftest.h file. If no sixth parameter is
provided, conftest.h will be empty.
* The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced.
This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro,
except it has support for a fifth option that is then passed as
the sixth parameter to ZFS_LINUX_COMPILE_IFELSE.
These autoconf changes were needed to test the availability of the Linux
tracepoint macros. Due to the odd nature of the Linux tracepoint macro
API, a separate ".h" must be created (the path and filename is used
internally by the kernel's define_trace.h file).
* The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This
is to determine if we can safely enable the Linux tracepoint
functionality. We need to selectively disable the tracepoint code
due to the kernel exporting certain functions as GPL only. Without
this check, the build process will fail at link time.
In addition, the SET_ERROR macro was modified into a tracepoint as well.
To do this, the 'sdt.h' file was moved into the 'include/sys' directory
and now contains a userspace portion and a kernel space portion. The
dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as
well.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Modify the code to use the utsname() kernel function rather than
a global variable. This results is cleaner more portable code
because utsname() is already provided by the kernel and can be
easily emulated in user space via uname(2). This means that it
will behave consistently in both contexts.
This is also has the benefit that it allows the removal of a few
_KERNEL pre-processor conditions. And it also is a pre-requisite
for a proper FUSE port because we need to provide a valid utsname.
Finally, it allows us to remove this functionality from the SPL
and all the related compatibility code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2757
... as opposed to sending properties of all snapshots of the relevant
filesystem. The previous behavior results in properties being set on
all snapshots on the receiving side, which is quite slow.
Behavior of zfs send -R is not changed.
References:
http://thread.gmane.org/gmane.comp.file-systems.openzfs.devel/346
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2729
Issue #2210
FreeBSD PR kern/172259: Fixes zfs receive errors caused by snapshot
replication being processed in a random order instead of creation
order.
Eliminates needless filesystem renames caused by removed parent
snapshots which subsequently causes many more errors.
PR: kern/172259
Submitted by: Steven Hartland
Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
MFC after: 2 weeks
References:
https://github.com/freebsd/freebsd/commit/4995789
Porting notes:
Minor whitespace fixes were made to conform with style requirements:
lib/libzfs/libzfs_sendrecv.c: 2269: indent by spaces instead of tabs
lib/libzfs/libzfs_sendrecv.c: 2270: indent by spaces instead of tabs
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2729
Creating virtual machines that have their rootfs on ZFS on hosts that
have their rootfs on ZFS causes SPA namespace collisions when the
standard name rpool is used. The solution is either to give each guest
pool a name unique to the host, which is not always desireable, or boot
a VM environment containing an ISO image to install it, which is
cumbersome.
26b42f3f9d introduced `zpool import -t
...` to simplify situations where a host must access a guest's pool when
there is a SPA namespace conflict. We build upon that to introduce
`zpool import -t tname ...`. That allows us to create a pool whose
in-core name is tname, but whose on-disk name is the normal name
specified.
This simplifies the creation of machine images that use a rootfs on ZFS.
That benefits not only real world deployments, but also ZFSOnLinux
development by decreasing the time needed to perform rootfs on ZFS
experiments.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2417
To aid in detecting and debugging stack overflow issues make the
user space stack limit configurable via a new ZFS_STACK_SIZE
environment variable. The value assigned to ZFS_STACK_SIZE will
be used as the default stack size in bytes.
Because this is mainly useful as a debugging aid in conjunction
with ztest the stack limit is disabled by default. See the ztest(1)
man page for additional details on using the ZFS_STACK_SIZE
environment variable.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#2743
Issue #2293
5116 zpool history -i goes into infinite loop
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Boris Protopopov <boris.protopopov@me.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5116https://github.com/illumos/illumos-gate/commit/3339867
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2715
This is intended to cause ztest to fail when rw_destroy() is called on a
rwlock that has owners.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2330
The 'zpool list -v' command displays lots of info but excludes the
capacity of each disk. This should be added.
5147 zpool list -v should show individual disk capacity
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5147https://github.com/illumos/illumos-gate/commit/7a09f97
Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2688
Remove an obsolete comment that refers to code removed by commit
79c6e4c4. The code and comment related to space consumed by guard
pages in user-space stacks, which we no longer take into account.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2722
When com.delphix:spacemap_histogram is disabled, the value of
fragmentation was printing as 18446744073709551615 (UINT64_MAX),
when it should print as '-'.
The issue was caused by a small mistake during the merge of
"4980 metaslabs should have a fragmentation metric."
upstream: https://github.com/illumos/illumos-gate/commit/2e4c998
ZoL: https://github.com/zfsonlinux/zfs/commit/f3a7f66
The problem is in zpool_get_prop_literal, where the handling of the
pool property ZPOOL_PROP_FRAGMENTATION was added to wrong the
section. In particular, ZPOOL_PROP_FRAGMENTATION should not be in
the section where zpool_get_state(zhp) == POOL_STATE_UNAVAIL, but
lower down after it's already been determined that the pool is in
fact available, which is where upstream illumos correctly has had
it.
Thanks to lundman for helping to track down this bug.
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2664
Providing a pkg-config file makes is easy for 3rd party applications
to link against the libzfs libraries. It also allows the libzfs
developers to modify the list of required libraries and cflags
without breaking existing applications.
The following example illustrates how pkg-config can be used:
cc `pkg-config --cflags --libs libzfs` -o myapp myapp.c
/*
* myapp.c
*/
void main()
{
libzfs_handle_t *hdl;
hdl = libzfs_init();
if (hdl)
libzfs_fini(hdl);
}
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #585
The HAVE_IOCTL_* configure checks were originally added for
compatibility with an ancient version of glibc. This support
and additional complexity is no longer needed and is therefore
being removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Closes#585
Change efi_rescan() to loop 10 times instead of 5 on EBUSY and
to sleep at the end of each loop. This helps with some instances
where the kernel does not reload the partition table fast enough
for ZFS to detect.
Signed-off-by: Andrew Hamilton <ahamilto@tjhsst.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2493
4976 zfs should only avoid writing to a failing non-redundant top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature is enabled
4983 need to collect metaslab information via mdb
4984 device selection should use fragmentation metric
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/4976https://www.illumos.org/issues/4978https://www.illumos.org/issues/4979https://www.illumos.org/issues/4980https://www.illumos.org/issues/4981https://www.illumos.org/issues/4982https://www.illumos.org/issues/4983https://www.illumos.org/issues/4984https://github.com/illumos/illumos-gate/commit/2e4c998
Notes:
The "zdb -M" option has been re-tasked to display the new metaslab
fragmentation metric and the new "zdb -I" option is used to control
the maximum number of in-flight I/Os.
The new fragmentation metric is derived from the space map histogram
which has been rolled up to the vdev and pool level and is presented
to the user via "zpool list".
Add a number of module parameters related to the new metaslab weighting
logic.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2595
Add #ifndef PAGESIZE to avoid redefinition warning on platforms
where this value is already provided.
Signed-off-by: Alec Salazar <alec.j.salazar@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2588
Most of the code base already uses va_list, which is specified by
iso-c. gcc/glibc provides 'typedef __gnuc_va_list va_list'. and
when not using gcc/glibc we can't expect to find __gnuc_va_list.
Signed-off-by: Alec Salazar <alec.j.salazar@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2588
4914 zfs on-disk bookmark structure should be named *_phys_t
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/4914https://github.com/illumos/illumos-gate/commit/7802d7b
Porting notes:
There were a number of zfsonlinux-specific uses of zbookmark_t which
needed to be updated. This should reduce the likelihood of further
problems like issue #2094 from occurring.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2558
4390 i/o errors when deleting filesystem/zvol can lead to space map corruption
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4390https://github.com/illumos/illumos-gate/commit/7fd05ac
Porting notes:
Previous stack-reduction efforts in traverse_visitb() caused a fair
number of un-mergable pieces of code. This patch should reduce its
stack footprint a bit more.
The new local bptree_entry_phys_t in bptree_add() is dynamically-allocated
using kmem_zalloc() for the purpose of stack reduction.
The new global zfs_free_leak_on_eio has been defined as an integer
rather than a boolean_t as was the case with the related zfs_recover
global. Also, zfs_free_leak_on_eio's definition has been inserted into
zfs_debug.c for consistency with the existing definition of zfs_recover.
Illumos placed it in spa_misc.c.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2545
4757 ZFS embedded-data block pointers ("zero block compression")
4913 zfs release should not be subject to space checks
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4757https://www.illumos.org/issues/4913https://github.com/illumos/illumos-gate/commit/5d7b4d4
Porting notes:
For compatibility with the fastpath code the zio_done() function
needed to be updated. Because embedded-data block pointers do
not require DVAs to be allocated the associated vdevs will not
be marked and therefore should not be unmarked.
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2544
4374 dn_free_ranges should use range_tree_t
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/4374https://github.com/illumos/illumos-gate/commit/bf16b11
Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2531
4171 clean up spa_feature_*() interfaces
4172 implement extensible_dataset feature for use by other zpool features
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>a
References:
https://www.illumos.org/issues/4171https://www.illumos.org/issues/4172https://github.com/illumos/illumos-gate/commit/2acef22
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2528
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101https://www.illumos.org/issues/4102https://www.illumos.org/issues/4103https://www.illumos.org/issues/4105https://www.illumos.org/issues/4106https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2488
Updates 962d524212.
The referenced fix to get_numeric_property() caused numeric property
lookups to consider the type of the parent (head) dataset when checking
validity but there are some cases in the caller expects to see the
property's default value even when the lookup is invalid.
One case in which this is true is change_one() which is part of the
renaming infrastructure. It may look up "zoned" on a snapshot of a volume
which is not valid but it expects to see the default value of false.
There may be other, yet unidentified cases in which zfs_prop_get_int()
is used on technically invalid properties but which expect the property's
default value to be returned.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Turbo Fredriksson <turbo@bayour.com>
Closes#2320
When adding or replacing a vdev with a different sector size the
error message should be more useful. In addition to describing
the problem provide a hint that the '-o ashift' option can be
used to override the optimal default value.
Since using a non-optimal value may incur a significant performance
penalty we should issue this error. But there a numerous reasons
why a administrator may wish to do this anyway.
Signed-off-by: Niklas Edmundsson <ZNikke@github>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2421
Clang's static analyzer reported a memory leak in zpool_clear_label().
Upon review, it turns out to be right. This should be a very short lived
leak because no daemons use this functionality, but that does not
preclude the possibility of third party daemons that do use it. Lets fix
it to be a good Samaritan.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2330
Resolve gcc 4.9.0 20140507 warnings about uninitialized 'ptr' when
using -Wmaybe-uninitialized. The first two cases appears appear
to be legitimate but not the second two. In general this is a
good practice so they are all initialized.
Signed-off-by: Marcel Huber <marcelhuberfoo@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2345
When fetching property values of snapshots, a check against the head
dataset type must be performed. Previously, this additional check was
performed only when fetching "version", "normalize", "utf8only" or "case".
This caused the ZPL properties "acltype", "exec", "devices", "nbmand",
"setuid" and "xattr" to be erroneously displayed with meaningless values
for snapshots of volumes. It also did not allow for the display of
"volsize" of a snapshot of a volume.
This patch adds the headcheck flag paramater to zfs_prop_valid_for_type()
and zprop_valid_for_type() to indicate the check is being done
against a head dataset's type in order that properties valid only for
snapshots are handled correctly. This allows the the head check in
get_numeric_property() to be performed when fetching a property for
a snapshot.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2265
This implements a subset of the LWP rwlock interface by wrapping the
equivalent POSIX thread interface. It is a superset of the features
needed by ztest.
The missing bits are {,_}rw_read_held() and {,_}rw_write_held().
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1970
zfs_is_mountable() fills in the mountpoint buffer, so, as in
upstream, it needs to have been called before the mountpoint
buffer can be used in error messages.
In particular,
return (zfs_error_fmt(hdl, EZFS_MOUNTFAILED,
dgettext(TEXT_DOMAIN, "cannot mount '%s'"),
mountpoint));
should not come before the call to zfs_is_mountable().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: ilovezfs <ilovezfs@icloud.com>
Closes#2284
Using the ARM reference simulation (fast model foundation v8) I
cross compiled spl and zfs, to confirm it works on ARMv8 (64 bit
arm architecture, called aarch64 in Linux).
As it is based on previous ARM porting, the resulting patch is
disappointingly small, there was very little to do. The code fixes
the compile issues and has light testing done.
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2260
Neither atime nor relatime should be considered to be "temporary mount
point properties". Their semantics are enforced completely within ZFS
and also they're (correctly) not documented as being temporary mount
point properties.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2257
If mkdirp() is called with a NULL ptr for the path arg, it will return
-1 with errno unchanged. This is unexpected since on error it should
return -1 and set errno to one of the error values listed for mkdir(2).
This commit sets errno = ENOENT for this NULL ptr case. This is in
accordance with the errors specified by mkdir(2):
ENOENT
A component of the path prefix does not exist or is a null pathname.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2248
The format strings in efi_get_info() are intended to extract both the
main device and partition number. However, this is only done correctly
for hd, sd and vd devices. The format strings for ram, dm-, md and loop
devices misparse the input. This causes the partition device to be
incorrectly labelled as the main device with the partition being
labelled 0.
Reported-by: ilovezfs <ilovezfs@icloud.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2175
This is a continuation of fb5c53ea65:
When /etc/mtab is updated on Linux it's done atomically with
rename(2). A new mtab is written, the existing mtab is unlinked,
and the new mtab is renamed to /etc/mtab. This means that we
must close the old file and open the new file to get the updated
contents. Using rewind(3) will just move the file pointer back
to the start of the file, freopen(3) will close and open the file.
In this commit, a few more rewind(3) calls were replaced with freopen(3)
to allow updated mtab entries to be picked up immediately.
Signed-off-by: John M. Layman <jml@frijid.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2215
Issue #1611
Several of the zfs utilities allow you to pass a vdev's guid rather
than the device name. However, the utilities are not consistent in
how they parse that guid. For example, 'zinject' expects the guid
to be passed as a hex value while 'zpool replace' wants it as a
decimal. The user is forced to just know what format to use.
This patch improve things by making the parsing more tolerant.
When strtol(3) is called using 0 for the base, rather than say
10 or 16, it will then accept hex, decimal, or octal input based
on the prefix. From the man page.
If base is zero or 16, the string may then include a "0x"
prefix, and the number will be read in base 16; otherwise,
a zero base is taken as 10 (decimal) unless the next character
is '0', in which case it is taken as 8 (octal).
NOTE: There may be additional conversions not caught be this patch.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
zpool_events_next() can be called in blocking mode by specifying a
non-zero value for the "block" parameter. However, the design of
the ZFS Event Daemon (zed) requires additional functionality from
zpool_events_next(). Instead of adding additional arguments to the
function, it makes more sense to use flags that can be bitwise-or'd
together.
This commit replaces the zpool_events_next() int "block" parameter with
an unsigned bitwise "flags" parameter. It also defines ZEVENT_NONE
to specify the default behavior. Since non-blocking mode can be
specified with the existing ZEVENT_NONBLOCK flag, the default behavior
becomes blocking mode. This, in effect, inverts the previous use
of the "block" parameter. Existing callers of zpool_events_next()
have been modified to check for the ZEVENT_NONBLOCK flag.
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2
Due to the very poorly chosen argument name 'cleanup_fd' it was
completely unclear that this file descriptor is used to track the
current cursor location. When the file descriptor is created by
opening ZFS_DEV a private cursor is created in the kernel for the
returned file descriptor. Subsequent calls to zpool_events_next()
and zpool_events_seek() then require the file descriptor as an
argument to reposition the cursor. When the file descriptor is
closed the kernel state tracking the cursor is destroyed.
This patch contains no functional change, it just changes a
few variable names and clarifies the documentation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
The ZFS_IOC_EVENTS_SEEK ioctl was added to allow user space callers
to seek around the zevent file descriptor by EID. When a specific
EID is passed and it exists the cursor will be positioned there.
If the EID is no longer cached by the kernel ENOENT is returned.
The caller may also pass ZEVENT_SEEK_START or ZEVENT_SEEK_END to seek
to those respective locations.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Issue #2
Changing properties with "zfs inherit" should cause the datasets
to be remounted. This ensures that the modified property values
will be propagated in to the filesystem namespace where they can
be enforced. This change is modeled after an identical fix made
to zfs_prop_set().
Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2201
Valgrind suggests that the address we are returning is not properly
aligned, so lets add an assertion.
==87740== Address 0x1012a22a is 554 bytes inside a block of size 4,096
alloc'd
==87740== at 0x4C2BBA0: memalign (in
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==87740== by 0x4C2BCC7: posix_memalign (in
/usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
==87740== by 0x52FA845: zio_buf_alloc (umem.h:101)
==87740== by 0x52F6226: zil_alloc_lwb (zil.c:463)
==87740== by 0x52F8559: zil_commit (zil.c:566)
==87740== by 0x40611D: ztest_freeze (ztest.c:5909)
==87740== by 0x4066A7: ztest_init (ztest.c:6048)
==87740== by 0x407AF4: main (ztest.c:6226)
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2174
From time to time it may be necessary to inform the pool administrator
about an errata which impacts their pool. These errata will by shown
to the administrator through the 'zpool status' and 'zpool import'
output as appropriate. The errata must clearly describe the issue
detected, how the pool is impacted, and what action should be taken
to resolve the situation. Additional information for each errata will
be provided at http://zfsonlinux.org/msg/ZFS-8000-ER.
To accomplish the above this patch adds the required infrastructure to
allow the kernel modules to notify the utilities that an errata has
been detected. This is done through the ZPOOL_CONFIG_ERRATA uint64_t
which has been added to the pool configuration nvlist.
To add a new errata the following changes must be made:
* A new errata identifier must be assigned by adding a new enum value
to the zpool_errata_t type. New enums must be added to the end to
preserve the existing ordering.
* Code must be added to detect the issue. This does not strictly
need to be done at pool import time but doing so will make the
errata visible in 'zpool import' as well as 'zpool status'. Once
detected the spa->spa_errata member should be set to the new enum.
* If possible code should be added to clear the spa->spa_errata member
once the errata has been resolved.
* The show_import() and status_callback() functions must be updated
to include an informational message describing the errata. This
should include an action message describing what an administrator
should do to address the errata.
* The documentation at http://zfsonlinux.org/msg/ZFS-8000-ER must be
updated to describe the errata. This space can be used to provide
as much additional information as needed to fully describe the errata.
A link to this documentation will be automatically generated in the
output of 'zpool import' and 'zpool status'.
Original-idea-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <ryao@gentoo.or
Issue #2094
Add the "relatime" property. When set to "on", a file's atime will only
be updated if the existing atime at least a day old or if the existing
ctime or mtime has been updated since the last access. This behavior
is compatible with the Linux "relatime" mount option.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2064Closes#1917
Under Linux its possible to mount the same filesystem multiple
times in the namespace. This can be done either with bind mounts
or simply with multiple mount points. Unfortunately, the mnttab
cache code is implemented using an AVL tree which does not support
duplicate entries. To avoid this issue this patch updates the
code to check for a duplicate entry before adding a new one.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Michael Martin <mgmartin.mgm@gmail.com>
Closes#2041
Add the minimum required ISA types to support the Sparc
architecture.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700
Removes the unconditional sharetab update when running any zfs
command. This means the sharetab might become out of date if
users are manually adding/removing shares with exportfs. But
we shouldn't punish all callers to zfs in order to handle that
unlikely case. In the unlikely event we observe issues because
of this it can always be added back to just the share/unshare
call paths where we need an up to date sharetab.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #845
Four new dataset properties have been added to support SELinux. They
are 'context', 'fscontext', 'defcontext' and 'rootcontext' which map
directly to the context options described in mount(8). When one of
these properties is set to something other than 'none'. That string
will be passed verbatim as a mount option for the given context when
the filesystem is mounted.
For example, if you wanted the rootcontext for a filesystem to be set
to 'system_u:object_r:fs_t' you would set the property as follows:
$ zfs set rootcontext="system_u:object_r:fs_t" storage-pool/media
This will ensure the filesystem is automatically mounted with that
rootcontext. It is equivalent to manually specifying the rootcontext
with the -o option like this:
$ zfs mount -o rootcontext=system_u:object_r:fs_t storage-pool/media
By default all four contexts are set to 'none'. Further information
on SELinux contexts is detailed in mount(8) and selinux(8) man pages.
Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#1504
The vast majority of these changes are in Linux specific code.
They are the result of not having an automated style checker to
validate the code when it was originally written. Others were
caused when the common code was slightly adjusted for Linux.
This patch contains no functional changes. It only refreshes
the code to conform to style guide.
Everyone submitting patches for inclusion upstream should now
run 'make checkstyle' and resolve any warning prior to opening
a pull request. The automated builders have been updated to
fail a build if when 'make checkstyle' detects an issue.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1821
Add acl, noacl and posixacl to option_map, avoiding ENOENT error
case when mount from util-linux-2.24 execs mount.zfs with any of
those flags
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: renelson <bnelson@nelsonbe.com>
Issue #1968
Early versions of ZFS coordinated the creation and destruction
of device minors from userspace. This was inherently racy and
in late 2009 these ioctl()s were removed leaving everything up
to the kernel. This significantly simplified the code.
However, we never picked up these changes in ZoL since we'd
already significantly adjusted this code for Linux. This patch
aims to rectify that by finally removing ZFC_IOC_*_MINOR ioctl()s
and moving all the functionality down in to the kernel. Since
this cleanup will change the kernel/user ABI it's being done
in the same tag as the previous libzfs_core ABI changes. This
will minimize, but not eliminate, the disruption to end users.
Once merged ZoL, Illumos, and FreeBSD will basically be back
in sync in regards to handling ZVOLs in the common code. While
each platform must have its own custom zvol.c implemenation the
interfaces provided are consistent.
NOTES:
1) This patch introduces one subtle change in behavior which
could not be easily avoided. Prior to this change callers
of 'zfs create -V ...' were guaranteed that upon exit the
/dev/zvol/ block device link would be created or an error
returned. That's no longer the case. The utilities will no
longer block waiting for the symlink to be created. Callers
are now responsible for blocking, this is why a 'udev_wait'
call was added to the 'label' function in scripts/common.sh.
2) The read-only behavior of a ZVOL now solely depends on if
the ZVOL_RDONLY bit is set in zv->zv_flags. The redundant
policy setting in the gendisk structure was removed. This
both simplifies the code and allows us to safely leverage
set_disk_ro() to issue a KOBJ_CHANGE uevent. See the
comment in the code for futher details on this.
3) Because __zvol_create_minor() and zvol_alloc() may now be
called in a sync task they must use KM_PUSHPAGE.
References:
illumos/illumos-gate@681d9761e8
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#1969
Under Linux this restriction does not apply because we have access
to all the required devices.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1631
In the current snapshot automount implementation, it is possible for
multiple mounts to attempted concurrently. Only one of the mounts will
succeed and the other will fail. The failed mounts will cause an EREMOTE
to be propagated back to the application.
This commit works around the problem by adding a new exit status,
MOUNT_BUSY to the mount.zfs program which is used when the underlying
mount(2) call returns EBUSY. The zfs code detects this condition and
treats it as if the mount had succeeded.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1819
3996 want a libzfs_core API to rollback to latest snapshot
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andy Stormont <andyjstormont@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3996illumos/illumos-gate@a7027df17f
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3909 "zfs send -D" does not work
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3909illumos/illumos-gate@36f7455d36
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3894 zfs should not allow snapshot of inconsistent dataset
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/3894illumos/illumos-gate@ca48f36f20
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl
Reviewed by: Matt Amdur <matt.amdur@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3829illumos/illumos-gate@bb6e70758d
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3818 zpool status -x should report pools with removed l2arc devices
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3818illumos/illumos-gate@7f2416ef64
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3740 Poor ZFS send / receive performance due to snapshot
hold / release processing
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3740illumos/illumos-gate@a7a845e4bf
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. 13fe019870 introduced a merge conflict
in dsl_dataset_user_release_tmp where some variables were moved
outside of the preprocessor directive.
2. dea9dfefdd747534b3846845629d2200f0616dad made the previous merge
conflict worse by switching KM_SLEEP to KM_PUSHPAGE. This is notable
because this commit refactors the code, adding a new KM_SLEEP
allocation. It is not clear to me whether this should be converted
to KM_PUSHPAGE.
3. We had a merge conflict in libzfs_sendrecv.c because of copyright
notices.
4. Several small C99 compatibility fixed were made.
3745 zpool create should treat -O mountpoint and -m the same
3811 zpool create -o altroot=/xyz -O mountpoint=/mnt ignores
the mountpoint option
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3745https://www.illumos.org/issues/3811illumos/illumos-gate@8b71377531
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3741 zfs needs better comments
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>
References:
https://www.illumos.org/issues/3741illumos/illumos-gate@3e30c24aee
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3699 zfs hold or release of a non-existent snapshot does not output error
3739 cannot set zfs quota or reservation on pool version < 22
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Shrock <eric.schrock@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
https://www.illumos.org/issues/3699https://www.illumos.org/issues/3739illumos/illumos-gate@013023d4ed
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3582 zfs_delay() should support a variable resolution
3584 DTrace sdt probes for ZFS txg states
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/3582illumos/illumos-gate@0689f76
Ported by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
3537 want pool io kstats
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Sa?o Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
http://www.illumos.org/issues/3537illumos/illumos-gate@c3a6601
Ported by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Porting Notes:
1. The patch was restructured to take advantage of the existing
spa statistics infrastructure. To accomplish this the kstat
was moved in to spa->io_stats and the init/destroy code moved
to spa_stats.c.
2. The I/O kstat was simply named <pool> which conflicted with the
pool directory we had already created. Therefore it was renamed
to <pool>/io
3. An update handler was added to allow the kstat to be zeroed.
This change introduces zpool_get_prop_literal. It's an expanded version
of zpool_get_prop taking one additional boolean parameter. With this
parameter set to B_FALSE it will behave identically to zpool_get_prop.
Setting it to B_TRUE will return full precision numbers for the
following properties:
ZPOOL_PROP_SIZE
ZPOOL_PROP_ALLOCATED
ZPOOL_PROP_FREE
ZPOOL_PROP_FREEING
ZPOOL_PROP_EXPANDSZ
ZPOOL_PROP_ASHIFT
Also introduced is a wrapper function for zpool_get_prop making it
use zpool_get_prop_literal in the background.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1813
Currently there is no mechanism to inspect which dbufs are being
cached by the system. There are some coarse counters in arcstats
by they only give a rough idea of what's being cached. This patch
aims to improve the current situation by adding a new dbufs kstat.
When read this new kstat will walk all cached dbufs linked in to
the dbuf_hash. For each dbuf it will dump detailed information
about the buffer. It will also dump additional information about
the referenced arc buffer and its related dnode. This provides a
more complete view in to exactly what is being cached.
With this generic infrastructure in place utilities can be written
to post-process the data to understand exactly how the caching is
working. For example, the data could be processed to show a list
of all cached dnodes and how much space they're consuming. Or a
similar list could be generated based on dnode type. Many other
ways to interpret the data exist based on what kinds of questions
you're trying to answer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When creating a new pool, or adding/replacing a disk in an existing
pool, partition tables will be automatically created on the devices.
Under normal circumstances it will take less than a second for udev
to create the expected device files under /dev/. However, it has
been observed that if the system is doing heavy IO concurrently udev
may take far longer. If you also throw in some cheap dodgy hardware
it may take even longer.
To prevent zpool commands from failing due to this the default wait
time for udev is being increased to 30 seconds. This will have no
impact on normal usage, the increase timeout should only be noticed
if your udev rules are incorrectly configured.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1646
The GNU libtool documentation states to start with a version of 0:0:0,
rather than 1:1:0. Illumos uses the name libzfs_core.so.1, so to be
consistent, we should go with 1:0:0.
http://www.gnu.org/software/libtool/manual/libtool.html#Updating-version-info
The GNU libtool documentation also provides guidence on how the version
information should be incremented. Doing this does a SONAME bump of the
libzfs and libzpool libraries. This is particularly important on Gentoo
because a SONAME bump enables portage to retain the older libraries
until any packages that link to them are rebuilt. The main example of
this is GRUB2's grub2-mkconfig, which will break unless it is rebuilt
against the new libraries.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1751
Libraries that depend on other libraries should list them in ELF's
DT_NEEDED field so that programs linking to them do not need to specify
those libraries unless they depend on them as well. This is not the case
in the current code and the consequence is that anything that needs a
library must know its dependencies. This is fragile and caused GRUB2's
configure script to break when a dependency was added on libblkid in
libzfs.
This resolves that problem by using LIBADD/LDADD to specify libraries in
Makefile.am instead of LDFLAGS. This ensures that proper DT_NEEDED
entries are generated and prevents GRUB2's configure script from
breaking in the presence of a libblkid dependency. This also removes
unneeded dependencies from various files.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1751
libblkid support is dormant because the autotools check is broken and
liblkid identifies ZFS vdevs as "zfs_member", not "zfs". We fix that
with a few changes:
First, we fix the libblkid autotools check to do a few things:
1. Make a 64MB file, which is the minimum size ZFS permits.
2. Make 4 fake uberblock entries to make libblkid's check succeed.
3. Return 0 upon success to make autotools use the success case.
4. Include stdlib.h to avoid implicit declration of free().
5. Check for "zfs_member", not "zfs"
6. Make --with-blkid disable autotools check (avoids Gentoo sandbox violation)
7. Pass '-lblkid' correctly using LIBS not LDFLAGS.
Second, we change the libblkid support to scan for "zfs_member", not
"zfs".
This makes --with-blkid work on Gentoo.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1751
2882 implement libzfs_core
2883 changing "canmount" property to "on" should not always remount dataset
2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
https://www.illumos.org/issues/2882https://www.illumos.org/issues/2883https://www.illumos.org/issues/2900illumos/illumos-gate@4445fffbbb
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1293
Porting notes:
WARNING: This patch changes the user/kernel ABI. That means that
the zfs/zpool utilities built from master are NOT compatible with
the 0.6.2 kernel modules. Ensure you load the matching kernel
modules from master after updating the utilities. Otherwise the
zfs/zpool commands will be unable to interact with your pool and
you will see errors similar to the following:
$ zpool list
failed to read pool configuration: bad address
no pools available
$ zfs list
no datasets available
Add zvol minor device creation to the new zfs_snapshot_nvl function.
Remove the logging of the "release" operation in
dsl_dataset_user_release_sync(). The logging caused a null dereference
because ds->ds_dir is zeroed in dsl_dataset_destroy_sync() and the
logging functions try to get the ds name via the dsl_dataset_name()
function. I've got no idea why this particular code would have worked
in Illumos. This code has subsequently been completely reworked in
Illumos commit 3b2aab1 (3464 zfs synctask code needs restructuring).
Squash some "may be used uninitialized" warning/erorrs.
Fix some printf format warnings for %lld and %llu.
Apply a few spa_writeable() changes that were made to Illumos in
illumos/illumos-gate.git@cd1c8b8 as part of the 3112, 3113, 3114 and
3115 fixes.
Add a missing call to fnvlist_free(nvl) in log_internal() that was added
in Illumos to fix issue 3085 but couldn't be ported to ZoL at the time
(zfsonlinux/zfs@9e11c73) because it depended on future work.
reading the file, but instead use libzfs_mnttab_find() which does
the nessesary freopen() for us.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1498
When /etc/mtab is updated on Linux it's done atomically with
rename(2). A new mtab is written, the existing mtab is unlinked,
and the new mtab is renamed to /etc/mtab. This means that we
must close the old file and open the new file to get the updated
contents. Using rewind(3) will just move the file pointer back
to the start of the file, freopen(3) will close and open the file.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1611
3098 zfs userspace/groupspace fail without saying why when run as non-root
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3098illumos/illumos-gate@70f56fa693
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1596
The zpool_read_label() function was subtly broken due to a
difference of behavior in fstat64(2) on Solaris vs Linux.
Under Solaris when a block device is stat'ed the st_size
field will contain the size of the device in bytes. Under
Linux this is only true for regular file and symlinks. A
compatibility function called fstat64_blk(2) was added
which can be used when the Solaris behavior is required.
This flaw was never noticed because the only time we would
need to use the device size is when the first two labels
are damaged. I noticed this issue while adding the
zpool_clear_label() function which is similar in design
and does require us to write all the labels.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The FreeBSD implementation of zfs adds the 'zpool labelclear'
command. Since this functionality is helpful and straight
forward to add it is being included in ZoL.
References:
freebsd/freebsd@119a041dc9
Ported-by: Dmitry Khasanov <pik4ez@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1126
This patch restores the zpool_clear_label() function from
OpenSolaris. This was removed by commit d603ed6 because
it wasn't clear we had a use for it in ZoL. However, this
functionality is a prerequisite for adding the 'zpool labelclear'
command from FreeBSD.
As part of bringing this change in the zpool_clear_label()
function was changed to use fstat64_blk(2) for compatibility
with Linux.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1126
The zpool command should not silently fail when the /etc/mtab
file does not exist. This can occur in an initramfs environment
when the /etc/mtab file hasn't yet been generated.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1541
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first
argument is zero
Reviewed by Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by George Wilson <george.wilson@delphix.com>
Approved by Eric Schrock <eric.schrock@delphix.com>
References:
illumos/illumos-gate@fb09f5aad4https://illumos.org/issues/3006
Requires:
zfsonlinux/spl@1c6d149feb
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1509
When calling mount(), care must be taken to avoid passing in flags
that are used only by the user space utilities. Otherwise we may
stomp on flags that are reserved for other purposes in the kernel.
In particular, openSUSE 12.3 kernels have added a new MS_RICHACL
super-block flag whose value conflicts with our MS_COMMENT flag. This
causes incorrect behavior such as the umask being ignored. The
MS_COMMENT flag essentially serves as a placeholder in the option_map
data structure of zfs_mount.c, but its value is never used. Therefore
we can avoid the conflict by defining it to 0.
The MS_USERS, MS_OWNER, and MS_GROUP flags also conflict with reserved
flags in the kernel. While this is not known to have caused any
problems, it is nevertheless incorrect. For the purposes of the
mount.zfs helper, the "users", "owner", and "group" options just serve
as hints to set additional implied options. Therefore we now define
their associated mount flags in terms of the options that they imply
rather than giving them unique values.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1457
The vn_rdwr() function performs I/O by calling the vfs_write() or
vfs_read() functions. These functions reside just below the system
call layer and the expectation is they have almost the entire 8k of
stack space to work with. In fact, certain layered configurations
such as ext+lvm+md+multipath require the majority of this stack to
avoid stack overflows.
To avoid this posibility the vn_rdwr() call in dump_bytes() has been
moved to the ZIO_TYPE_FREE, taskq. This ensures that all I/O will be
performed with the majority of the stack space available. This ends
up being very similiar to as if the I/O were issued via sys_write()
or sys_read().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1399Closes#1423
Recent changes have caused older versions of gcc to mistakenly
flag 'old_umask' in vn_open() as an unitialized variable. To
silence the warning initialize it.
kernel.c: In function 'vn_open':
kernel.c:525:6: error: 'old_umask' may be used uninitialized
in this function
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
NOTES: This patch has been reworked from the original in the
following ways to accomidate Linux ZFS implementation
*) Usage of the cyclic interface was replaced by the delayed taskq
interface. This avoids the need to implement new compatibility
code and allows us to rely on the existing taskq implementation.
*) An extern for zfs_txg_synctime_ms was added to sys/dsl_pool.h
because declaring externs in source files as was done in the
original patch is just plain wrong.
*) Instead of panicing the system when the deadman triggers a
zevent describing the blocked vdev and the first pending I/O
is posted. If the panic behavior is desired Linux provides
other generic methods to panic the system when threads are
observed to hang.
*) For reference, to delay zios by 30 seconds for testing you can
use zinject as follows: 'zinject -d <vdev> -D30 <pool>'
References:
illumos/illumos-gate@283b84606bhttps://www.illumos.org/issues/3246
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1396
3422 zpool create/syseventd race yield non-importable pool
3425 first write to a new zvol can fail with EFBIG
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
illumos/illumos-gate@bda8819455https://www.illumos.org/issues/3422https://www.illumos.org/issues/3425
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1390
Do basic sanity checks in smb_available() to verify that the 'net'
command and the usershare directory exists.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1124
The new snapdev dataset property may be set to control the
visibility of zvol snapshot devices. By default this value
is set to 'hidden' which will prevent zvol snapshots from
appearing under /dev/zvol/ and /dev/<dataset>/. When set to
'visible' all zvol snapshots for the dataset will be visible.
This functionality was largely added because when automatic
snapshoting is enabled large numbers of read-only zvol snapshots
will be created. When creating these devices the kernel will
attempt to read their partition tables, and blkid will attempt
to identify any filesystems on those partitions. This leads
to a variety of issues:
1) The zvol partition tables will be read in the context of
the `modprobe zfs` for automatically imported pools. This
is undesirable and should be done asynchronously, but for
now reducing the number of visible devices helps.
2) Udev expects to be able to complete its work for a new
block devices fairly quickly. When many zvol devices are
added at the same time this is no longer be true. It can
lead to udev timeouts and missing /dev/zvol links.
3) Simply having lots of devices in /dev/ can be aukward from
a management standpoint. Hidding the devices your unlikely
to ever use helps with this. Any snapshot device which is
needed can be made visible by changing the snapdev property.
NOTE: This patch changes the default behavior for zvols which
was effectively 'snapdev=visible'.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1235Closes#945
Issue #956
Issue #756
The machelf.h header is never included by anything in the zfs
build process. It is all effectively dead code which can be
safely removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1265
binutils 2.23.1 fails in situations that generate function relocations
on PowerPC and possibly other architectures. This causes linking of
libzpool to fail because it depends on libnvpair. We add a dependency on
libnvpair to lib/libzpool/Makefile.am to correct that.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1267
In the interest of maintaining only one udev helper to give vdevs
user friendly names, the zpool_id and zpool_layout infrastructure
is being retired. They are superseded by vdev_id which incorporates
all the previous functionality.
Documentation for the new vdev_id(8) helper and its configuration
file, vdev_id.conf(5), can be found in their respective man pages.
Several useful example files are installed under /etc/zfs/.
/etc/zfs/vdev_id.conf.alias.example
/etc/zfs/vdev_id.conf.multipath.example
/etc/zfs/vdev_id.conf.sas_direct.example
/etc/zfs/vdev_id.conf.sas_switch.example
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#981
Commit 4b2f65b253 increased the user
space stack by 4x to resolve certain stack overflows. As such it
no longer makes sense to worry about a single extra page which
might or might not be part of the process stack. There is now
ample headroom for normal usage.
By eliminating this configure check we are also resolving the
following segfault which intentionally occurs at configure time
and may be logged in dmesg.
conftest[22156]: segfault at 7fbf18a47e48 ip 00000000004007fe
sp 00007fbf18a4be50 error 6 in conftest[400000+1000]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
3035 LZ4 compression support in ZFS and GRUB
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <csiden@delphix.com>
References:
illumos/illumos-gate@a6f561b4aehttps://www.illumos.org/issues/3035http://wiki.illumos.org/display/illumos/LZ4+Compression+In+ZFS
This patch has been slightly modified from the upstream Illumos
version to be compatible with Linux. Due to the very limited
stack space in the kernel a lz4 workspace kmem cache is used.
Since we are using gcc we are also able to take advantage of the
gcc optimized __builtin_ctz functions.
Support for GRUB has been dropped from this patch. That code
is available but those changes will need to made to the upstream
GRUB package.
Lastly, several hunks of dead code were dropped for clarity. They
include the functions real_LZ4_uncompress(), LZ4_compressBound()
and the Visual Studio specific hunks wrapped in _MSC_VER.
Ported-by: Eric Dillmann <eric@jave.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1217
The differ() function used strerror_r() instead of strerror() because
it allowed the error message to be directly copied in to a buffer.
This causes two issues under Linux.
* There are two versions of strerror_r() available an XSI-compliant
version which returns an 'int' error code. And a GNU-specific
version which return a 'char *' to the resulting error string.
int strerror_r(int errnum, char *buf, size_t buflen); /* XSI */
char *strerror_r(int errnum, char *buf, size_t buflen); /* GNU */
* The most recent versions of strerror_r() are annotated with the
warn_unused_result attribute. This causes the following warning
since the upstream implementation casts the result to void.
warning: ignoring return value of 'strerror_r', declared with
attribute warn_unused_result [-Wunused-result]
The cleanest way to resolve both of these problems is just to use
strerror() and make a copy of the result in to the buffer. This
resolves both issues and this is the only instance of strerror_r()
in the code base.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1231
In the stream_bytes() library function used by `zfs diff`, explicitly
cast each byte in the input string to an unsigned character so that the
Linux fprintf() correctly escapes to octal and does not mangle the output.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1172
2762 zpool command should have better support for feature flags
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@57221772c3https://www.illumos.org/issues/2762
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@dfbb943217
illumos changeset: 13777:b1e53580146d
https://www.illumos.org/issues/3090https://www.illumos.org/issues/3102
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#939
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
illumos/illumos-gate@53089ab7c8illumos/illumos-gate@ad135b5d64
illumos changeset: 13700:2889e2596bd6
https://www.illumos.org/issues/2619https://www.illumos.org/issues/2747
NOTE: The grub specific changes were not ported. This change
must be made to the Linux grub packages.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fix setting/getting users/groups in quota properties through
numeric identifier. This support was accidentally disabled
in the original port by applying the HAVE_IDMAP wrapper macro
too broadly.
Fix obtained by moving #ifdef HAVE_IDMAP to exclude only
the part of code that really needs IDMAP. Now zfs (get|set)
(user|group)quota@1000 works as expected.
Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1147
Incorrect syntax should never cause a segfault. In this case
listing multiple comma delimited options after '-o' triggered
the problem. For example:
zpool create -o ashift=12,listsnaps=on
This patch resolves the issue by wrapping the calls which use
hdr with a NULL test.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1118
Add the initial support for the 'smbshare' option using the
existing libshare infrastructure. Because this implementation
relies on usershares samba version 3.0.23 is required.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#493
Commit 57a4edd allows the bootfs property to be set on any pool.
However, many of the zpool commands still prevent you from using
EFI labeled devices for the root pool. For example:
# zpool attach rpool /dev/sda /dev/sdb
cannot label 'sdb': EFI labeled devices are not supported on
root pools. on root devices.
For non-Solaris builds such as Linux disable this error.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1077
When a zvol with snapshots is renamed the device files under
/dev/zvol/ are not renamed. This patch resolves the problem
by destroying and recreating the minors with the new name so
the links can be recreated bu udev.
Original-patch-by: Suman Chakravartula <schakrava@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#408
Create a kstat file which contains useful statistics about the
last N txgs processed. This can be helpful when analyzing pool
performance. The new KSTAT_TYPE_TXG type was added for this
purpose and it tracks the following statistics per-txg.
txg - Unique txg number
state - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted
birth; - Creation time
nread - Bytes read
nwritten; - Bytes written
reads - IOPs read
writes - IOPs write
open_time; - Length in nanoseconds the txg was open
quiesce_time - Length in nanoseconds the txg was quiescing
sync_time; - Length in nanoseconds the txg was syncing
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
During 'zpool import' all ZPOOL_CONFIG_PATH names are supposed
to be updated by fix_paths(). This was not happening for spare
and cache devices because the proper names were getting filtered
out of the pool_list_t->names. Interestingly, the names were
being filtered because the spare and cache devices do not
contain the pool name in their vdev label.
The fix is to exclude the device path from the list only if:
1) has a valid ZPOOL_CONFIG_POOL_NAME key in the label, and
2) that pool name does not match the specified pool name.
Since the label is valid and because it does properly store the
vdev guid it will be correctly assembled without the pool name.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#725
The 'zpool replace' command would fail when given a short name
because unlike on other platforms the short name cannot be
deterministically expanded to a single path. Multiple path
prefixes must be checked and in addition the partition suffix
for whole disks is determined by the prefix.
To handle this complexity a zfs_strcmp_pathname() function was
added which takes either a short or fully qualified device name.
Short names will be expanded using the prefixes in the default
import search path, or the ZPOOL_IMPORT_PATH environment variable
if it's defined. All posible expansions are then compared against
the comparison path. Care is taken to strip redundant slashes to
ensure legitimate matches are not missed.
In the context of this work the existing zfs_resolve_shortname()
function was extended to consider the ZPOOL_IMPORT_PATH when set.
The zfs_append_partition() interface was also simplified to take
only a single buffer.
The vast majority of these changes rework existing Linux specific
code which was originally written to accomidate udev. However,
there is some minimal cleanup which removes Illumos specific code.
This was done to improve readability but the basic flow and intent
of the upstream code was maintained.
These changes are the logical conclusion of the previos work to
adjust the 'zpool import' search behavior, see commit 44867b6a.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#544Closes#976
Both the SPL and the ZFS libspl export most of the atomic_* functions,
except atomic_sub_* functions which are only exported by the SPL, not by
libspl. This patch remedies that by implementing atomic_sub_* functions
in libspl.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1013
illumos/illumos-gate@2e2c135528
Illumos changeset: 13780:6da32a929222
3100 zvol rename fails with EBUSY when dirty
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Adam H. Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>
Ported-by: Etienne Dechamps <etienne.dechamps@ovh.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#995
Currently, thread_create(), when called in userspace, creates a
joinable (i.e. not detached thread). This is the pthread default.
Unfortunately, this does not reproduce kthreads behavior (kthreads
are always detached). In addition, this contradicts the original
Solaris code which creates userspace threads in detached mode.
These joinable threads are never joined, which leads to a leakage of
pthread thread objects ("zombie threads"). This in turn results in
excessive ressource consumption, and possible ressource exhaustion in
extreme cases (e.g. long ztest runs).
This patch fixes the issue by creating userspace threads in detached
mode. The only exception is ztest worker threads which are meant to be
joinable.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #989
Introduced by commit 44867b6d6e.
We should of course check to ensure best isn't NULL before
attempting to dereference it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#974
The goal of this change is to make 'zpool import' prefer to use
the peristent /dev/mapper or /dev/disk/by-* paths. These are far
preferable to the devices in /dev/ whos names are not persistent
and are determined by the order in which a device is detected.
This patch improves things by changing the default search path from
just to the top level /dev/ directory to (in order):
/dev/disk/by-vdev - Custom rules, use first if they exist
/dev/disk/zpool - Custom rules, use first if they exist
/dev/mapper - Use multipath devices before components
/dev/disk/by-uuid - Single unique entry and persistent
/dev/disk/by-id - May be multiple entries and persistent
/dev/disk/by-path - Encodes physical location and persistent
/dev/disk/by-label - Custom persistent labels
/dev - UNSAFE device names will change
The default search path can be overriden by setting the
ZPOOL_IMPORT_PATH environment variable. This must be a colon
delimited list of paths which are searched for vdevs. If the
'zpool import -d' option is specified only those listed paths
will be searched.
Finally, when multiple paths to the same device are found. If one
of the paths is an exact match for the path used last time to import
the pool it will be used. When there are no exact matches the
prefered path will be determined by the provided search order.
This means you can still import a pool and force specific names by
providing the -d <path> option. And the prefered names will persist
as long as those paths exist on your system.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#965
Delay executing exportfs command until its results are actually
required.
Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
In 1e33ac1e26, the maximum stack size for
userspace tools was set to 8k to mimic the available kernel stack size.
Unfortunately, due to differences in how the stack is used in userspace
vs kernel space, spurious stack overflows could occur in userspace
tools due to the limited stack size. This is especially true in ztest
when debugging is enabled.
This patch multiplies the userspace stack size by 4, which fixes the
stack overflow issues. This comes at the price of not being able to
catch stack size issues in userspace, but the previous solution proved
unreliable anyway.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes#934.
Commit 858219c makes more sense down below in the 'if (verbose)'
section of the code. Initially, buf and path will never point
to the same location. Once 'path = buf' is set on a raidz vdev,
the code may drop into the verbose section depending on the
verbose flag. In here, using a tmpbuf makes sense since now
'buf == path'.
This issue does not occur in the upstream Solaris code because
their implementations of snprintf() allow for buf and path to
be the same address.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#57
Remove all of the generated autotools products from the repository
and update the .gitignore files accordingly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#718
1796 "ZFS HOLD" should not be used when doing "ZFS SEND" from a read-only pool
2871 support for __ZFS_POOL_RESTRICT used by ZFS test suite
2903 zfs destroy -d does not work
2957 zfs destroy -R/r sometimes fails when removing defer-destroyed snapshot
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
https://www.illumos.org/issues/1796https://www.illumos.org/issues/2871https://www.illumos.org/issues/2903https://www.illumos.org/issues/2957
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <George.Wilson@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/2635
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#717
This regression was accidentally introduced by commit
330d06f90d due to ZoL
specific code. The fix is to simply ensure the passed
nvlist is initialized and freed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#876
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/1693
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#678
Currently, zvols have a discard granularity set to 0, which suggests to
the upper layer that discard requests of arbirarily small size and
alignment can be made efficiently.
In practice however, ZFS does not handle unaligned discard requests
efficiently: indeed, it is unable to free a part of a block. It will
write zeros to the specified range instead, which is both useless and
inefficient (see dnode_free_range).
With this patch, zvol block devices expose volblocksize as their discard
granularity, so the upper layer is aware that it's not supposed to send
discard requests smaller than volblocksize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#862
1644 add ZFS "clones" property
1645 add ZFS "written" and "written@..." properties
1646 "zfs send" should estimate size of stream
1647 "zfs destroy" should determine space reclaimed by
destroying multiple snapshots
1708 adjust size of zpool history data
References:
https://www.illumos.org/issues/1644https://www.illumos.org/issues/1645https://www.illumos.org/issues/1646https://www.illumos.org/issues/1647https://www.illumos.org/issues/1708
This commit modifies the user to kernel space ioctl ABI. Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated. This change has reordered
all of the new ioctl()s to the end of the list. This should
help minimize this issue in the future.
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Albert Lee <trisk@opensolaris.org>
Approved by: Garrett D'Amore <garret@nexenta.com>
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#826Closes#664
When libzfs checks if the module is loaded or not, it currently reads
/proc/modules and searches for a line matching the module name.
Unfortunately, if the module is included in the kernel itself (built-in
module), then /proc/modules won't list it, so libzfs will wrongly conclude
that the module is not loaded, thus making all ZFS userspace tools unusable.
Fortunately, all loaded modules appear as directories in /sys/module, even
built-in ones. Thus we can use /sys/module in lieu of /proc/modules to fix
the issue.
As a bonus, the code for checking becomes much simpler.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851
The end_writeback() function was changed by moving the call to
inode_sync_wait() earlier in to evict(). This effecitvely changes
the ordering of the sync but it does not impact the details of
the zfs implementation.
However, as part of this change end_writeback() was renamed to
clear_inode() to reflect the new semantics. This change does
impact us and clear_inode() now maps to end_writeback() for
kernels prior to 3.5.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#784
The vmtruncate_range() support has been removed from the kernel in
favor of using the fallocate method in the file_operations table.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
The export_operations member ->encode_fh() has been updated to
take both the child and parent inodes. This interface used to
take the child dentry and a bool describing if the parent is needed.
NOTE: While updating this code I noticed that we do not currently
cleanly handle the case where we're passed a connectable parent.
This code should be audited to make sure we're doing the right thing.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #784
Currently, zpool online -e (dynamic vdev expansion) doesn't work on
whole disks because we're invoking ioctl(BLKRRPART) from userspace
while ZFS still has a partition open on the disk, which results in
EBUSY.
This patch moves the BLKRRPART invocation from the zpool utility to the
module. Specifically, this is done just before opening the device in
vdev_disk_open() which is called inside vdev_reopen(). This requires
jumping through some hoops to get to the disk device from the partition
device, and to make sure we can still open the partition after the
BLKRRPART call.
Note that this new code path is triggered on dynamic vdev expansion
only; other actions, like creating a new pool, are unchanged and still
call BLKRRPART from userspace.
This change also depends on API changes which are available in 2.6.37
and latter kernels. The build system has been updated to detect this,
but there is no compatibility mode for older kernels. This means that
online expansion will NOT be available in older kernels. However, it
will still be possible to expand the vdev offline.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#808
This code was was inherited from Solaris which was careful to define
the expected VTOC for various supported architectures. While this
check may have made sense there it's something we should be able to
safely drop under Linux.
However, I'm not quite ready to do that yet. So for the moment I'm
just doing the very safe thing of adding PowerPC as a supported type.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Commit e5dc681a changed EFI_NUMPAR from 9 to 128. This means that the
on-disk EFI label has efi_nparts = 128 instead of 9. The index of the
reserved partition, however, is still 8. This breaks
efi_use_whole_disk(), which uses efi_nparts-1 as the index of the
reserved partition.
This commit fixes efi_use_whole_disk() when the index of the reserved
partition is not efi_nparts-1. It rewrites the algorithm and makes it
more robust by using the order of the partitions instead of their
numbering. It assumes that the last non-empty partition is the reserved
partition, and that the non-empty partition before that is the data
partition.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #808
Currently, zpool_vdev_online() calls zpool_relabel_disk() with a short
partition device name, which is obviously wrong because (1)
zpool_relabel_disk() expects a full, absolute path to use with open()
and (2) efi_write() must be called on an opened disk device, not a
partition device.
With this patch, zpool_relabel_disk() gets called with a full disk
device path. The path is determined using the same algorithm as
zpool_find_vdev().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #808
The error handling code around zpool_relabel_disk() is either inexistent
or wrong. The function call itself is not checked, and
zpool_relabel_disk() is generating error messages from an unitialized
buffer.
Before:
# zpool online -e homez sdb; echo $?
`: cannot relabel 'sdb1': unable to open device: 2
0
After:
# zpool online -e homez sdb; echo $?
cannot expand sdb: cannot relabel 'sdb1': unable to open device: 2
1
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #808
1949 crash during reguid causes stale config
1953 allow and unallow missing from zpool history since removal of pyzfs
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Garrett D'Amore <garrett.damore@gmail.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Approved by: Eric Schrock <eric.schrock@delphix.com>
References:
https://www.illumos.org/issues/1949https://www.illumos.org/issues/1953
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#665
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Alexander Eremin <alexander.eremin@nexenta.com>
Reviewed by: Alexander Stetsenko <ams@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/1748
This commit modifies the user to kernel space ioctl ABI. Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated. If only the user space
component is updated both the 'zpool events' command and the
'zpool reguid' command will not work until the kernel modules
are updated.
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#665
FreeBSD #xxx: Dramatically optimize listing snapshots when user
requests only snapshot names and wants to sort them by name, ie.
when executes:
# zfs list -t snapshot -o name -s name
Because only name is needed we don't have to read all snapshot
properties.
Below you can find how long does it take to list 34509 snapshots
from a single disk pool before and after this change with cold and
warm cache:
before:
# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 525s
warm cache: 218s
after:
# time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 1.7s
warm cache: 1.1s
NOTE: This patch only appears in FreeBSD. If/when Illumos picks up
the change we may want to drop this patch and adopt their version.
However, for now this addresses a real issue.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #450
When failing to remove a zvol device link because it's busy, wait
a bit and retry in a loop instead of giving up immediately. This
technique is similar to the loop in zpool_label_disk_wait(), with
the same goal: waiting for the asynchronous udev processes to finish
their work.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#692
torvalds/linux@adc0e91ab1 introduced
introduced d_make_root() as a replacement for d_alloc_root(). Further
commits appear to have removed d_alloc_root() from the Linux source
tree. This causes the following failure:
error: implicit declaration of function 'd_alloc_root'
[-Werror=implicit-function-declaration]
To correct this we update the code to use the current d_make_root()
interface for readability. Then we introduce an autotools check
to determine if d_make_root() is available. If it isn't then we
define some compatibility logic which used the older d_alloc_root()
interface.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#776
When a device is already open O_EXCL by another process the
`zpool import` will correctly fail. However, the default failure
message isn't very helpful. It may in fact be harmful if you
take its advise and destroy your pool.
cannot import 'tank': pool is busy
Destroy and re-create the pool from
a backup source.
Improve the error message in the EBUSY case to simply print a
message indicating that the devices are current in use. The user
will need to manually identify which process has the device open
exclusively and why.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When creating pools short device names may be used when those
devices appear in certain well known locations under /dev/.
This change adds /dev/mapper/ to that list.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
vdev_id parses the file /etc/zfs/vdev_id.conf to map a physical path
in a storage topology to a channel name. The channel name is combined
with a disk enclosure slot number to create an alias that reflects the
physical location of the drive. This is particularly helpful when it
comes to tasks like replacing failed drives. Slot numbers may also be
re-mapped in case the default numbering is unsatisfactory. The drive
aliases will be created as symbolic links in /dev/disk/by-vdev.
The only currently supported topologies are sas_direct and sas_switch:
o sas_direct - a channel is uniquely identified by a PCI slot and a
HBA port
o sas_switch - a channel is uniquely identified by a SAS switch port
A multipath mode is supported in which dm-mpath devices are handled by
examining the first running component disk, as reported by 'multipath
-l'. In multipath mode the configuration file should contain a
channel definition with the same name for each path to a given
enclosure.
vdev_id can replace the existing zpool_id script on systems where the
storage topology conforms to sas_direct or sas_switch. The script
could be extended to support other topologies as well. The advantage
of vdev_id is that it is driven by a single static input file that can
be shared across multiple nodes having a common storage toplogy.
zpool_id, on the other hand, requires a unique /etc/zfs/zdev.conf per
node and a separate slot-mapping file. However, zpool_id provides the
flexibility of using any device names that show up in
/dev/disk/by-path, so it may still be needed on some systems.
vdev_id's functionality subsumes that of the sas_switch_id script, and
it is unlikely that anyone is using it, so sas_switch_id is removed.
Finally, /dev/disk/by-vdev is added to the list of directories that
'zpool import' will scan.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#713
The mode argument of iops->create()/mkdir()/mknod() was changed from
an 'int' to a 'umode_t'. To prevent a compiler warning an autoconf
check was added to detect the API change and then correctly set a
zpl_umode_t typedef. There is no functional change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#701
Allow rigorous (and expensive) tx validation to be enabled/disabled
indepentantly from the standard zfs debugging. When enabled these
checks ensure that all txs are constructed properly and that a dbuf
is never dirtied without taking the correct tx hold.
This checking is particularly helpful when adding new dmu consumers
like Lustre. However, for established consumers such as the zpl
with no known outstanding tx construction problems this is just
overhead.
--enable-debug-dmu-tx - Enable/disable validation of each tx as
--disable-debug-dmu-tx it is constructed. By default validation
is disabled due to performance concerns.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add support for the .zfs control directory. This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required. The bulk of the core
functionality is now all there with the following limitations.
*) The .zfs/snapshot directory automount support requires a 2.6.37
or newer kernel. The exception is RHEL6.2 which has backported
the d_automount patches.
*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
in the .zfs/snapshot directory works as expected. However,
this functionality is only available to root until zfs
delegations are finished.
* mkdir - create a snapshot
* rmdir - destroy a snapshot
* mv - rename a snapshot
The following issues are known defeciences, but we expect them to
be addressed by future commits.
*) Add automount support for kernels older the 2.6.37. This should
be possible using follow_link() which is what Linux did before.
*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
The majority of the ground work for this is complete. However,
finishing this work will require resolving some lingering
integration issues with the Linux NFS kernel server.
*) The .zfs/shares directory exists but no futher smb functionality
has yet been implemented.
Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#173
Some devices have exhibited sensitivity to the ending alignment of
partitions. In particular, even if the first partition begins at 1
MiB, we have seen many sd driver task abort errors with certain SSDs
if the first partition doesn't end on a 1 MiB boundary. This occurs
when the vdev label is read during pool creation or importation and
causes a delay of about 30 seconds per device. It can also be
simulated with dd when the pool isn't imported:
dd if=/dev/sda1 of=/dev/null bs=262144 count=1
For the record, this problem was observed with SMARTMOD
SG9XCA2E200GE01 200GB SSDs. Unfortunately I don't have a good
explanation for this behavior. It seems to have something to do with
highly fragmented single-sector requests being issued to the device,
which it may not support. With end-aligned partitions at least
page-sized requests were queued and issued to the driver according
to blktrace. In any case, aligning the partition end is a fairly
innocuous work-around, wasting at most 1 MiB of space.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#574
Allow a source rpm to be rebuilt with debugging enabled. This
avoids the need to have to manually modify the spec file. By
default debugging is still largely disabled. To enable specific
debugging features use the following options with rpmbuild.
'--with debug' - Enables ASSERTs
# For example:
$ rpmbuild --rebuild --with debug zfs-modules-0.6.0-rc6.src.rpm
Additionally, ZFS_CONFIG has been added to zfs_config.h for
packages which build against these headers. This is critical
to ensure both zfs and the dependant package are using the same
prototype and structure definitions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
New SSDs are now available which use an internal 8k block size.
To make sure ZFS can get the maximum performance out of these
devices we're increasing the maximum ashift to 13 (8KB).
This value is still small enough that we can fit 16 uberblocks
in the vdev ring label. However, I don't want to increase this
any futher or it will limit the ability the safely roll back a
pool to recover it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#565
Resolves nfs-utils-1.0.x compatibility issue which requires
that the fsid be set in the export options.
exportfs: Warning: /tank/dir requires fsid= for NFS export
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#570
DISCARD (REQ_DISCARD, BLKDISCARD) is useful for thin provisioning.
It allows ZVOL clients to discard (unmap, trim) block ranges from
a ZVOL, thus optimizing disk space usage by allowing a ZVOL to
shrink instead of just grow.
We can't use zfs_space() or zfs_freesp() here, since these functions
only work on regular files, not volumes. Fortunately we can use the
low-level function dmu_free_long_range() which does exactly what we
want.
Currently the discard operation is not added to the log. That's not
a big deal since losing discard requests cannot result in data
corruption. It would however result in disk space usage higher than
it should be. Thus adding log support to zvol_discard() is probably
a good idea for a future improvement.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Currently only the (FALLOC_FL_PUNCH_HOLE) flag combination is
supported, since it's the only one that matches the behavior of
zfs_space(). This makes it pretty much useless in its current
form, but it's a start.
To support other flag combinations we would need to modify
zfs_space() to make it more flexible, or emulate the desired
functionality in zpl_fallocate().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
The Linux block device queue subsystem exposes a number of configurable
settings described in Linux block/blk-settings.c. The defaults for these
settings are tuned for hard drives, and are not optimized for ZVOLs. Proper
configuration of these options would allow upper layers (I/O scheduler) to
take better decisions about write merging and ordering.
Detailed rationale:
- max_hw_sectors is set to unlimited (UINT_MAX). zvol_write() is able to
handle writes of any size, so there's no reason to impose a limit. Let the
upper layer decide.
- max_segments and max_segment_size are set to unlimited. zvol_write() will
copy the requests' contents into a dbuf anyway, so the number and size of
the segments are irrelevant. Let the upper layer decide.
- physical_block_size and io_opt are set to the ZVOL's block size. This
has the potential to somewhat alleviate issue #361 for ZVOLs, by warning
the upper layers that writes smaller than the volume's block size will be
slow.
- The NONROT flag is set to indicate this isn't a rotational device.
Although the backing zpool might be composed of rotational devices, the
resulting ZVOL often doesn't exhibit the same behavior due to the COW
mechanisms used by ZFS. Setting this flag will prevent upper layers from
making useless decisions (such as reordering writes) based on incorrect
assumptions about the behavior of the ZVOL.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zvol_write() assumes that the write request must be written to stable storage
if rq_is_sync() is true. Unfortunately, this assumption is incorrect. Indeed,
"sync" does *not* mean what we think it means in the context of the Linux
block layer. This is well explained in linux/fs.h:
WRITE: A normal async write. Device will be plugged.
WRITE_SYNC: Synchronous write. Identical to WRITE, but passes down
the hint that someone will be waiting on this IO
shortly.
WRITE_FLUSH: Like WRITE_SYNC but with preceding cache flush.
WRITE_FUA: Like WRITE_SYNC but data is guaranteed to be on
non-volatile media on completion.
In other words, SYNC does not *mean* that the write must be on stable storage
on completion. It just means that someone is waiting on us to complete the
write request. Thus triggering a ZIL commit for each SYNC write request on a
ZVOL is unnecessary and harmful for performance. To make matters worse, ZVOL
users have no way to express that they actually want data to be written to
stable storage, which means the ZIL is broken for ZVOLs.
The request for stable storage is expressed by the FUA flag, so we must
commit the ZIL after the write if the FUA flag is set. In addition, we must
commit the ZIL before the write if the FLUSH flag is set.
Also, we must inform the block layer that we actually support FLUSH and FUA.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Autoconf will fail to detect the ZoL libnvpair on systems that do not
implicitly link library runtime dependencies, which is anything that
has the GCC 4.5 DCO update.
Build libuutil before libnvpair, and put it on the the LDADD line of
the libnvpair automake template.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #560
The second argument of sops->show_options() was changed from a
'struct vfsmount *' to a 'struct dentry *'. Add an autoconf check
to detect the API change and then conditionally define the expected
interface. In either case we are only interested in the zfs_sb_t.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#549
Historically the internal zfs debug infrastructure has been
scattered throughout the code. Since we expect to start making
more use of this code this patch performs some cleanup.
* Consolidate the zfs debug infrastructure in the zfs_debug.[ch]
files. This includes moving the zfs_flags and zfs_recover
variables, plus moving the zfs_panic_recover() function.
* Remove the existing unused functionality in zfs_debug.c and
replace it with code which correctly utilized the spl logging
infrastructure.
* Remove the __dprintf() function from zfs_ioctl.c. This is
dead code, the dprintf() functionality in the kernel relies
on the spl log support.
* Remove dprintf() from hdr_recl(). This wasn't particularly
useful and was missing the required format specifier anyway.
* Subsequent patches should unify the dprintf() and zfs_dbgmsg()
functions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Since the zpios and potentially other ZFS tests use the
DMU_OST_OTHER type to label their datasets, the zpool and
zfs commands should gracefully handle this type when it is
encountered. This patch modifies the commands' behavior
to ignore any datasets with a dds_type of DMU_OST_OTHER.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#536
Commit zfsonlinux/zfs@57a4eddc4d
allows the bootfs property to be set on any pool, but does not
accommodate subsequent vdev changes. For example:
# zpool replace rpool /dev/sda /dev/sdb
operation not supported on this type of pool
property 'bootfs' is not supported on EFI labeled devices
For non-Solaris builds, disable the check that emits this error.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
These libraries, which are an artifact of the ZoL development
process, conflict with packages that are already in distribution:
* libspl: SPL Programming Language
* libavl: AVL for Linux
* libefi: GRUB
And these libraries are potential conflicts:
* libshare: the Linux Mount Manager
* libunicode: Perl and Python
Recompose these five ZoL components into the four libraries that are
conventionally provided by Solaris and FreeBSD systems:
+ libnvpair
+ libuutil
+ libzpool
+ libzfs
This change resolves the name conflict, makes ZoL more compatible
with existing software that uses autotools to detect ZFS, and allows
pkg-zfs to better reflect the official Debian kFreeBSD packaging.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #430
Linux supports mounting over non-empty directories by default.
In Solaris this is not the case and -O option is required for
zfs mount to mount a zfs filesystem over a non-empty directory.
For compatibility, I've added support for -O option to mount
zfs filesystems over non-empty directories if the user wants
to, just like in Solaris.
I've defined MS_OVERLAY to record it in the flags variable if
the -O option is supplied. The flags variable passes through
a few functions and its checked before performing the empty
directory check in zfs_mount function. If -O is given, the
check is not performed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#473
The Linux 3.1 kernel has introduced the concept of per-filesystem
shrinkers which are directly assoicated with a super block. Prior
to this change there was one shared global shrinker.
The zfs code relied on being able to call the global shrinker when
the arc_meta_limit was exceeded. This would cause the VFS to drop
references on a fraction of the dentries in the dcache. The ARC
could then safely reclaim the memory used by these entries and
honor the arc_meta_limit. Unfortunately, when per-filesystem
shrinkers were added the old interfaces were made unavailable.
This change adds support to use the new per-filesystem shrinker
interface so we can continue to honor the arc_meta_limit. The
major benefit of the new interface is that we can now target
only the zfs filesystem for dentry and inode pruning. Thus we
can minimize any impact on the caching of other filesystems.
In the context of making this change several other important
issues related to managing the ARC were addressed, they include:
* The dnlc_reduce_cache() function which was called by the ARC
to drop dentries for the Posix layer was replaced with a generic
zfs_prune_t callback. The ZPL layer now registers a callback to
drop these dentries removing a layering violation which dates
back to the Solaris code. This callback can also be used by
other ARC consumers such as Lustre.
arc_add_prune_callback()
arc_remove_prune_callback()
* The arc_reduce_dnlc_percent module option has been changed to
arc_meta_prune for clarity. The dnlc functions are specific to
Solaris's VFS and have already been largely eliminated already.
The replacement tunable now represents the number of bytes the
prune callback will request when invoked.
* Less aggressively invoke the prune callback. We used to call
this whenever we exceeded the arc_meta_limit however that's not
strictly correct since it results in over zeleous reclaim of
dentries and inodes. It is now only called once the arc_meta_limit
is exceeded and every effort has been made to evict other data from
the ARC cache.
* More promptly manage exceeding the arc_meta_limit. When reading
meta data in to the cache if a buffer was unable to be recycled
notify the arc_reclaim thread to invoke the required prune.
* Added arcstat_prune kstat which is incremented when the ARC
is forced to request that a consumer prune its cache. Remember
this will only occur when the ARC has no other choice. If it
can evict buffers safely without invoking the prune callback
it will.
* This change is also expected to resolve the unexpect collapses
of the ARC cache. This would occur because when exceeded just the
arc_meta_limit reclaim presure would be excerted on the arc_c
value via arc_shrink(). This effectively shrunk the entire cache
when really we just needed to reclaim meta data.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#466Closes#292
Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit
SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170
Use the new set_nlink() kernel function instead.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #462
Added the necessary build infrastructure for building packages
compatible with the Arch Linux distribution. As such, one can now run:
$ ./configure
$ make pkg # Alternatively, one can run 'make arch' as well
on the Arch Linux machine to create two binary packages compatible with
the pacman package manager, one for the zfs userland utilities and
another for the zfs kernel modules. The new packages can then be
installed by running:
# pacman -U $package.pkg.tar.xz
In addition, source-only packages suitable for an Arch Linux chroot
environment or remote builder can also be build using the 'sarch' make
rule.
NOTE: Since the source dist tarball is created on the fly from the head
of the build tree, it's MD5 hash signature will be continually influx.
As a result, the md5sum variable was intentionally omitted from the
PKGBUILD files, and the '--skipinteg' makepkg option is used. This may
or may not have any serious security implications, as the source tarball
is not being downloaded from an outside source.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#491
It has been observed that some of the hottest locks are those
of the zio taskqs. Contention on these locks can limit the
rate at which zios are dispatched which limits performance.
This upstream change from Illumos uses new interface to the
taskqs which allow them to utilize a prealloc'ed taskq_ent_t.
This removes the need to perform an allocation at dispatch
time while holding the contended lock. This has the effect
of improving system performance.
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Alexey Zaytsev <alexey.zaytsev@nexenta.com>
Reviewed by: Jason Brian King <jason.brian.king@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References to Illumos issue:
https://www.illumos.org/issues/734
Ported-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#482
Some of the functions' purpose wasn't immediately obvious without
additional explanations. This commit adds these missing comments.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
While setting/getting userquota and groupquota properties, the input
was not treated as a possible username or groupname if it had a
leading digit. While useradd in linux recommends the regexp
[a-z_][a-z0-9_-]*[$]? , it is not enforced. This causes problem for
usernames with leading digits in them. We need to be able to support
getting and setting properties for this unconventional but possible
input category
I've updated the code to validate the username or groupname directly
via the API. Also, note that I moved this validation to the beginning
before the check for SID names with @. This also supports usernames
with @ character in them which are valid. Only when input with @ is
not a valid username, it is interpreted as a potential SID name.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#428
While we initially allowed you to set your ashift as large as 17
(SPA_MAXBLOCKSIZE) that is actually unsafe. What wasn't considered
at the time is that each uberblock written to the vdev label ring
buffer will be of this size. Now the buffer is statically sized
to 128k and we need to be able to fit several uberblocks in it.
With a large ashift that becomes a problem.
Therefore I'm reducing the maximum configurable ashift value to 12.
This is large enough for the 4k sector drives and small enough that
we can still keep the most recent 32 uberblock in the vdev label
ring buffer.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#425
Update the code to use the bdi_setup_and_register() helper to
simplify the bdi integration code. The updated code now just
registers the bdi during mount and destroys it during unmount.
The only complication is that for 2.6.32 - 2.6.33 kernels the
helper wasn't available so in these cases the zfs code must
provide it. Luckily the bdi_setup_and_register() function
is trivial.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#367
GPT's created by libefi set the HeaderSize attribute in the GPT
header to 512 -- size of the GPT header INCLUDING the 420 padding
bytes at the end. Most other tools set the size to 92 -- size of
the actual header itself excluding the padding. Most tools check
the recorded HeaderSize when verifying CRC, but gptfdisk hardcodes
92 and thus reports CRC verification problems for full-disk vdevs
created IE with `zpool create pool sdc`.
This commit changes libefi's behavior for GPT creation and also
fixes several edge cases where libefi's behavior was similar
(though in an incompatible manner) to gptfdisk. Libefi assumed
HeaderSize was always 512 even if the GPT recorded a different
value. Sanity checks of the GPT headersize read from disk were
added before applying checksum calculation -- this will prevent
segfault in cases of bogus on-disk values.
Zpools created with the resuling libefi are verified as correct
both by parted and gptfdisk. Also pool have been tested to
import correctly on ZFS on Linux as well as Solaris Express 11
livecd.
Signed-off-by: Zachary Bedell <zac@thebedells.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#344
For a long time now the kernel has been moving away from using the
pdflush daemon to write 'old' dirty pages to disk. The primary reason
for this is because the pdflush daemon is single threaded and can be
a limiting factor for performance. Since pdflush sequentially walks
the dirty inode list for each super block any delay in processing can
slow down dirty page writeback for all filesystems.
The replacement for pdflush is called bdi (backing device info). The
bdi system involves creating a per-filesystem control structure each
with its own private sets of queues to manage writeback. The advantage
is greater parallelism which improves performance and prevents a single
filesystem from slowing writeback to the others.
For a long time both systems co-existed in the kernel so it wasn't
strictly required to implement the bdi scheme. However, as of
Linux 2.6.36 kernels the pdflush functionality has been retired.
Since ZFS already bypasses the page cache for most I/O this is only
an issue for mmap(2) writes which must go through the page cache.
Even then adding this missing support for newer kernels was overlooked
because there are other mechanisms which can trigger writeback.
However, there is one critical case where not implementing the bdi
functionality can cause problems. If an application handles a page
fault it can enter the balance_dirty_pages() callpath. This will
result in the application hanging until the number of dirty pages in
the system drops below the dirty ratio.
Without a registered backing_device_info for the filesystem the
dirty pages will not get written out. Thus the application will hang.
As mentioned above this was less of an issue with older kernels because
pdflush would eventually write out the dirty pages.
This change adds a backing_device_info structure to the zfs_sb_t
which is already allocated per-super block. It is then registered
when the filesystem mounted and unregistered on unmount. It will
not be registered for mounted snapshots which are read-only. This
change will result in flush-<pool> thread being dynamically created
and destroyed per-mounted filesystem for writeback.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#174
This should simplify the code a bit by re-using existing code
to fork and exec a process.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #190
Simply closing the stdout and/or stderr file descriptors for
the child process can have bad side effects if for example
the child writes to stdout/stderr after open()ing a file.
The open() call might have returned the same file descriptor
one would usually expect for stdout/stderr (1 and 2), thereby
causing mis-directed writes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #190
At the moment we call exportfs -v every time we check whether an
NFS share is active. This happens every time you run a zfs or
zpool command, making them extremely slow when you have a lot of
exports. The time taken is approx O(n2) of the number of shares.
This commit stores the output from exportfs -v in a temporary file
and use this to speed up subsequent accesses.
This mechanism is still too slow - if you have tens of thousands
of NFS shares it will still be painful running ANY zfs/zpool
command.
Signed-off-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#341
Add a "REFRATIO" property, which is the compression ratio based on
data referenced. For snapshots, this is the same as COMPRESSRATIO,
but for filesystems/volumes, the COMPRESSRATIO is based on the
data "USED" (ie, includes blocks in children, but not blocks
shared with the origin).
This is needed to figure out how much space a filesystem would
use if it were not compressed (ignoring snapshots).
Reviewed by: George Wilson <George.Wilson@delphix.com>
Reviewed by: Adam Leventhal <Adam.Leventhal@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Mark Musante <Mark.Musante@oracle.com>
Reviewed by: Garrett D'Amore <garrett@nexenta.com>
Approved by: Garrett D'Amore <garrett@nexenta.com>
References to Illumos issue and patch:
- https://www.illumos.org/issues/1092
- https://github.com/illumos/illumos-gate/commit/187d6ac08a
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
Unlike most other Linux distributions archlinux installs its
init scripts in /etc/rc.d insead of /etc/init.d. This commit
provides an archlinux rc.d script for zfs and extends the
build infrastructure to ensure it get's installed in the
correct place.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#322
The zfs libraries were never properly versioned. Since the API has
remained static for quite some time this we never an issue. However,
going forward they should be versioned. This commit versions all
of the libraries to 1.0.0. From here on out this version must be
updated to reflect changes to the library.
Drop usage of dlopen/dlsym for libshare. There is no need to do
this because the zfs packages provide libshare. Unlike on Solaris
we are guaranteed it will be available.
This avoids possible problems with hardcoding the libshare path in
the code (e.g. when users specify a different install path via
configure options). It additionally simplifies the code which is
good for maintainability.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The sharenfs and sharesmb properties depend on the libshare library
to export datasets via NFS and SMB. This commit implements the base
libshare functionality as well as support for managing NFS shares.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The lib/libspl/include/libgen.h header file was being mistakenly
left out of the 'make dist' tarball. It just happens this doesn't
cause a build failure when creating packages because the system
libgen/h is included instead. This simply results in the following
warning due to the missing forward declaration of mkdirp().
../../lib/libzfs/libzfs_mount.c:417:3: warning: implicit declaration
of function 'mkdirp' [-Wimplicit-function-declaration]
The .get_sb callback has been replaced by a .mount callback
in the file_system_type structure. When using the new
interface the caller must now use the mount_nodev() helper.
Unfortunately, the new interface no longer passes the vfsmount
down to the zfs layers. This poses a problem for the existing
implementation because we currently save this pointer in the
super block for latter use. It provides our only entry point
in to the namespace layer for manipulating certain mount options.
This needed to be done originally to allow commands like
'zfs set atime=off tank' to work properly. It also allowed me
to keep more of the original Solaris code unmodified. Under
Solaris there is a 1-to-1 mapping between a mount point and a
file system so this is a fairly natural thing to do. However,
under Linux they many be multiple entries in the namespace
which reference the same filesystem. Thus keeping a back
reference from the filesystem to the namespace is complicated.
Rather than introduce some ugly hack to get the vfsmount and
continue as before. I'm leveraging this API change to update
the ZFS code to do things in a more natural way for Linux.
This has the upside that is resolves the compatibility issue
for the long term and fixes several other minor bugs which
have been reported.
This commit updates the code to remove this vfsmount back
reference entirely. All modifications to filesystem mount
options are now passed in to the kernel via a '-o remount'.
This is the expected Linux mechanism and allows the namespace
to properly handle any options which apply to it before passing
them on to the file system itself.
Aside from fixing the compatibility issue, removing the
vfsmount has had the benefit of simplifying the code. This
change which fairly involved has turned out nicely.
Closes#246Closes#217Closes#187Closes#248Closes#231
The security_inode_init_security() function now takes an additional
qstr argument which must be passed in from the dentry if available.
Passing a NULL is safe when no qstr is available the relevant
security checks will just be skipped.
Closes#246Closes#217Closes#187
The inode eviction should unmap the pages associated with the inode.
These pages should also be flushed to disk to avoid the data loss.
Therefore, use truncate_setsize() in evict_inode() to release the
pagecache.
The API truncate_setsize() was added in 2.6.35 kernel. To ensure
compatibility with the old kernel, the patch defines its own
truncate_setsize function.
Signed-off-by: Prasad Joshi <pjoshi@stec-inc.com>
Closes#255