Compare commits

..

2282 Commits

Author SHA1 Message Date
Tony Hutter 1222e921c9 Tag zfs-0.8.2
META file and changelog updated.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
2019-09-25 11:27:51 -07:00
Kody A Kantor c37fa0d5a8 Disabled resilver_defer feature leads to looping resilvers
When a disk is replaced with another on a pool with the resilver_defer
feature present, but not enabled the resilver activity restarts during
each spa_sync. This patch checks to make sure that the resilver_defer
feature is first enabled before requesting a deferred resilver.

This was originally fixed in illumos-joyent as OS-7982.

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Signed-off-by: Kody A Kantor <kody@kkantor.com>
External-issue: illumos-joyent OS-7982
Closes #9299
Closes #9338
2019-09-25 11:27:51 -07:00
Andriy Gapon 12a78fbb4f Fix dsl_scan_ds_clone_swapped logic
The was incorrect with respect to swapping dataset IDs both in the
on-disk ZAP object and the in-memory queue.

In both cases, if ds1 was already present, then it would be first
replaced with ds2 and then ds would be replaced back with ds1.
Also, both cases did not properly handle a situation where both ds1 and
ds2 are already queued.  A duplicate insertion would be attempted and
its failure would result in a panic.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes #9140
Closes #9163
2019-09-25 11:27:51 -07:00
loli10K 63d8f57fe7 Scrubbing root pools may deadlock on kernels without elevator_change() (#9321)
Originally the zfs_vdev_elevator module option was added as a
convenience so the requested elevator would be automatically set
on the underlying block devices. At the time this was simple
because the kernel provided an API function which did exactly this.

This API was then removed in the Linux 4.12 kernel which prompted
us to add compatibly code to set the elevator via a usermodehelper.

Unfortunately changing the evelator via usermodehelper requires reading
some userland binaries, most notably modprobe(8) or sh(1), from a zfs
dataset on systems with root-on-zfs. This can deadlock the system if
used during the following call path because it may need, if the data
is not already cached in the ARC, reading directly from disk while
holding the spa config lock as a writer:

  zfs_ioc_pool_scan()
    -> spa_scan()
      -> spa_scan()
        -> vdev_reopen()
          -> vdev_elevator_switch()
            -> call_usermodehelper()

While the usermodehelper waits sh(1), modprobe(8) is blocked in the
ZIO pipeline trying to read from disk:

  INFO: task modprobe:2650 blocked for more than 10 seconds.
       Tainted: P           OE     5.2.14
  modprobe        D    0  2650    206 0x00000000
  Call Trace:
   ? __schedule+0x244/0x5f0
   schedule+0x2f/0xa0
   cv_wait_common+0x156/0x290 [spl]
   ? do_wait_intr_irq+0xb0/0xb0
   spa_config_enter+0x13b/0x1e0 [zfs]
   zio_vdev_io_start+0x51d/0x590 [zfs]
   ? tsd_get_by_thread+0x3b/0x80 [spl]
   zio_nowait+0x142/0x2f0 [zfs]
   arc_read+0xb2d/0x19d0 [zfs]
   ...
   zpl_iter_read+0xfa/0x170 [zfs]
   new_sync_read+0x124/0x1b0
   vfs_read+0x91/0x140
   ksys_read+0x59/0xd0
   do_syscall_64+0x4f/0x130
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This commit changes how we use the usermodehelper functionality from
synchronous (UMH_WAIT_PROC) to asynchronous (UMH_NO_WAIT) which prevents
scrubs, and other vdev_elevator_switch() consumers, from triggering the
aforementioned issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Issue #8664
Closes #9321
2019-09-25 11:27:51 -07:00
Chengfei ZHu 9fa8b5b55b QAT related bug fixes
1. Fix issue:  Kernel BUG with QAT during decompression  #9276.
   Now it is uninterruptible for a specific given QAT request,
   but Ctrl-C interrupt still works in user-space process.

2. Copy the digest result to the buffer only when doing encryption,
   and vise-versa for decryption.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chengfei Zhu <chengfeix.zhu@intel.com>
Closes #9276
Closes #9303
2019-09-25 11:27:51 -07:00
Brian Behlendorf e17445d1f7 kmodtool: depmod path
Determine the location of depmod on the system, either /sbin/depmod or
/usr/sbin/depmod.  Then use that path when generating the specfile.

Additionally, update the Requires lines to reference the package which
provides depmod rather than the binary itself.  For CentOS/RHEL 7+8
and all supported Fedora releases this is the kmod package, and for
CentOS/RHEL 6 it is the module-init-tools package.

Reviewed-by: Minh Diep <mdiep@whamcloud.com>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8724
Closes #9310
2019-09-25 11:27:51 -07:00
Brian Behlendorf 97d4986214 Fix /etc/hostid on root pool deadlock
Accidentally introduced by dc04a8c which now takes the SCL_VDEV lock
as a reader in zfs_blkptr_verify().  A deadlock can occur if the
/etc/hostid file resides on a dataset in the same pool.  This is
because reading the /etc/hostid file may occur while the caller is
holding the SCL_VDEV lock as a writer.  For example, to perform a
`zpool attach` as shown in the abbreviated stack below.

To resolve the issue we cache the system's hostid when initializing
the spa_t, or when modifying the multihost property.  The cached
value is then relied upon for subsequent accesses.

Call Trace:
    spa_config_enter+0x1e8/0x350 [zfs]
    zfs_blkptr_verify+0x33c/0x4f0 [zfs] <--- trying read lock
    zio_read+0x6c/0x140 [zfs]
    ...
    vfs_read+0xfc/0x1e0
    kernel_read+0x50/0x90
    ...
    spa_get_hostid+0x1c/0x38 [zfs]
    spa_config_generate+0x1a0/0x610 [zfs]
    vdev_label_init+0xa0/0xc80 [zfs]
    vdev_create+0x98/0xe0 [zfs]
    spa_vdev_attach+0x14c/0xb40 [zfs] <--- grabbed write lock

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9256
Closes #9285
2019-09-25 11:27:51 -07:00
Olaf Faaland 0ae5f0c8d2 BuildRequires libtirpc-devel needed for RHEL 8
Building against RHEL 8 requires libtirpc-devel, as with fedora 28.
Add rhel8 and centos8 options to the test, to account for that.

BuildRequires Originally added for fedora 28 via commit
1a62a305be

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #9289
2019-09-25 11:27:51 -07:00
loli10K 146d7d8846 Fix zpool subcommands error message with some unsupported options
Both 'detach' and 'online' zpool subcommands, when provided with an
unsupported option, forget to print it in the error message:

   # zpool online -t rpool vda3
   invalid option ''
   usage:
      online [-e] <pool> <device> ...

This changes fixes the error message in order to include the actual
option that is not supported.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9270
2019-09-25 11:27:51 -07:00
loli10K 9f261b1be6 Fix zfs-dkms .deb package warning in prerm script
Debian zfs-dkms package generated by alien doesn't call the prerm script
(rpm's %preun) with an integer as first parameter, which results in the
following warning when the package is uninstalled:

   "zfs-dkms.prerm: line 3: [: remove: integer expression expected"

Modify the if-condition to avoid the warning.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9271
2019-09-25 11:27:51 -07:00
Pavel Zakharov 5acba22ec0 zvol_wait script should ignore partially received zvols
Partially received zvols won't have links in /dev/zvol.

Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Closes #9260
2019-09-25 11:27:51 -07:00
Pavel Zakharov 38528476bf New service that waits on zvol links to be created
The zfs-volume-wait.service scans existing zvols and waits for their
links under /dev to be created. Any service that depends on zvol
links to be there should add a dependency on zfs-volumes.target.
By default, this target is not enabled.

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pavel Zakharov <pzakharov@delphix.com>
Closes #8975
2019-09-25 11:27:51 -07:00
Andriy Gapon beb21db3c6 Always refuse receving non-resume stream when resume state exists
This fixes a hole in the situation where the resume state is left from
receiving a new dataset and, so, the state is set on the dataset itself
(as opposed to %recv child).

Additionally, distinguish incremental and resume streams in error
messages.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes #9252
2019-09-25 11:27:51 -07:00
loli10K 13e5e396a3 Fix Intel QAT / ZFS compatibility on v4.7.1+ kernels
This change use the compat code introduced in 9cc1844a.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9268
Closes #9269
2019-09-25 11:27:51 -07:00
Georgy Yakovlev 3cf4ecb03f etc/init.d/zfs-functions.in: remove arch warning
Remove the x86_64 warning, it's no longer the case that this is the
only supported architecture.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Georgy Yakovlev <gyakovlev@gentoo.org>
Closes: #9177
2019-09-25 11:27:51 -07:00
Pavel Zakharov 0e765c4eb8 zfs_handle used after being closed/freed in change_one callback
This is a typical case of use after free. We would call zfs_close(zhp)
which would free the handle, and then call zfs_iter_children() on that
handle later.  This change ensures that the zfs_handle is only closed
when we are ready to return.

Running `zfs inherit -r sharenfs pool` was failing with an error
code without any error messages. After some debugging I've pinpointed
the issue to be memory corruption, which would cause zfs to try to
issue an ioctl to the wrong device and receive ENOTTY.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Issue #7967
Closes #9165
2019-09-25 11:27:51 -07:00
Chunwei Chen c7a4255f12 Fix zil replay panic when TX_REMOVE followed by TX_CREATE
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7151
Closes #8910
Closes #9123
Closes #9145
2019-09-25 11:27:51 -07:00
Andriy Gapon 931bef81c8 zfs_ioc_snapshot: check user-prop permissions on snapshotted datasets
Previously, the permissions were checked on the pool which was obviously
incorrect.

After this change, zfs_check_userprops() only validates the properties
without any permission checks.  The permissions are checked individually
for each snapshotted dataset.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Closes #9179
Closes #9180
2019-09-25 11:27:50 -07:00
Richard Allen ea34735203 Fix Plymouth passphrase prompt in initramfs script
Entering the ZFS encryption passphrase under Plymouth wasn't working
because in the ZFS initrd script, Plymouth was calling zfs via
"--command", which wasn't passing through the filesystem argument to
zfs load-key properly (it was passing through the single quotes around
the filesystem name intended to handle spaces literally,
which zfs load-key couldn't understand).

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Garrett Fields <ghfields@gmail.com>
Signed-off-by: Richard Allen <belperite@gmail.com>
Issue #9193
Closes #9202
2019-09-25 11:27:50 -07:00
Tom Caputi 95319fc569 Fix deadlock in 'zfs rollback'
Currently, the 'zfs rollback' code can end up deadlocked due to
the way the kernel handles unreferenced inodes on a suspended fs.
Essentially, the zfs_resume_fs() code path may cause zfs to spawn
new threads as it reinstantiates the suspended fs's zil. When a
new thread is spawned, the kernel may attempt to free memory for
that thread by freeing some unreferenced inodes. If it happens to
select inodes that are a a part of the suspended fs a deadlock
will occur because freeing inodes requires holding the fs's
z_teardown_inactive_lock which is still held from the suspend.

This patch corrects this issue by adding an additional reference
to all inodes that are still present when a suspend is initiated.
This prevents them from being freed by the kernel for any reason.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9203
2019-09-25 11:27:50 -07:00
Ryan Moeller 33374f21f0 Make slog test setup more robust
The slog tests fail when attempting to create pools using file vdevs
that already exist from previous test runs. Remove these files in the
setup for the test.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9194
2019-09-25 11:27:50 -07:00
yshui 512a50f38d zfs-mount-genrator: dependencies should be space-separated
Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Closes #9174
2019-09-25 11:27:50 -07:00
Tony Hutter 023ab67a64 Linux 5.3: Fix switch() fall though compiler errors
Fix some switch() fall-though compiler errors:

    abd.c:1504:9: error: this statement may fall through

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #9170
2019-09-25 11:27:50 -07:00
Dominic Pearson 65469f6e30 Linux 5.3 compat: Makefile subdir-m no longer supported
Uses obj-m instead, due to kernel changes.

See LKML: Masahiro Yamada, Tue, 6 Aug 2019 19:03:23 +0900

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Dominic Pearson <dsp@technoanimal.net>
Closes #9169
2019-09-25 11:27:50 -07:00
Chunwei Chen 569f5d5d05 Fix out-of-order ZIL txtype lost on hardlinked files
We should only call zil_remove_async when an object is removed. However,
in current implementation, it is called whenever TX_REMOVE is called. In
the case of hardlinked file, every unlink will generate TX_REMOVE and
causing operations to be dropped even when the object is not removed.

We fix this by only calling zil_remove_async when the file is fully
unlinked.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #8769
Closes #9061
2019-09-25 11:27:50 -07:00
Michael Niewöhner 6d1599c1e1 Increase default zcmd allocation to 256K
When creating hundreds of clones (for example using containers with
LXD) cloning slows down as the number of clones increases over time.
The reason for this is that the fetching of the clone information
using a small zcmd buffer requires two ioctl calls, one to determine
the size and a second to return the data. However, this requires
gathering the data twice, once to determine the size and again to
populate the zcmd buffer to return it to userspace.
These are expensive ioctl() calls, so instead, make the default buffer
size much larger: 256K.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #9084
2019-09-25 11:27:50 -07:00
Matthew Ahrens 6c9882d5db Improve performance by using dmu_tx_hold_*_by_dnode()
In zfs_write() and dmu_tx_hold_sa(), we can use dmu_tx_hold_*_by_dnode()
instead of dmu_tx_hold_*(), since we already have a dbuf from the target
dnode in hand.  This eliminates some calls to dnode_hold(), which can be
expensive.  This is especially impactful if several threads are
accessing objects that are in the same block of dnodes, because they
will contend for that dbuf's lock.

We are seeing 10-20% performance wins for the sequential_writes tests in
the performance test suite, when doing >=128K writes to files with
recordsize=8K.

This also removes some unnecessary casts that are in the area.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #9081
2019-09-25 11:27:50 -07:00
Brian Behlendorf 8c00159411 Fix channel programs on s390x
When adapting the original sources for s390x the JMP_BUF_CNT was
mistakenly halved due to an incorrect assumption of the size of
a unsigned long.  They are 8 bytes for the s390x architecture.
Increase JMP_BUF_CNT accordingly.

Authored-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: Colin Ian King <canonical.com>
Tested-by: Colin Ian King <canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8992
Closes #9080
2019-09-25 11:27:50 -07:00
George Wilson a8c5bcb5de Race between zfs-share and zfs-mount services
When a system boots the zfs-mount.service and the
zfs-share.service can start simultaneously. What may be
unclear is that sharing a filesystem will first mount
the filesystem if it's not already mounted. This means
that both service can race to mount the same fileystem.
This race can result in a SEGFAULT or EBUSY conditions.

This change explicitly defines the start ordering between the
two services such that the zfs-mount.service is solely
responsible for mounting filesystems eliminating the race
between "zfs mount -a" and "zfs share -a" commands.

Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #9083
2019-09-25 11:27:50 -07:00
Tomohiro Kusumi 6c68594675 Implement secpolicy_vnode_setid_retain()
Don't unconditionally return 0 (i.e. retain SUID/SGID).
Test CAP_FSETID capability.

https://github.com/pjd/pjdfstest/blob/master/tests/chmod/12.t
which expects SUID/SGID to be dropped on write(2) by non-owner fails
without this. Most filesystems make this decision within VFS by using
a generic file write for fops.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9035
Closes #9043
2019-09-25 11:27:50 -07:00
Matthew Ahrens 1f5979d23f zed crashes when devid not present
zed core dumps due to a NULL pointer in zfs_agent_iter_vdev(). The
gs_devid is NULL, but the nvl has a "devid" entry.

zfs_agent_post_event() checks that ZFS_EV_VDEV_GUID or DEV_IDENTIFIER is
present in nvl, but then later it and zfs_agent_iter_vdev() assume that
DEV_IDENTIFIER is present and thus gs_devid is set.

Typically this is not a problem because usually either all vdevs have
devid's, or none of them do. Since zfs_agent_iter_vdev() first checks if
the vdev has devid before dereferencing gs_devid, the problem isn't
typically encountered. However, if some vdevs have devid's and some do
not, then the problem is easily reproduced.  This can happen if the pool
has been moved from a system that has devid's to one that does not.

The fix is for zfs_agent_iter_vdev() to only try to match the devid's if
both nvl and gsp have devid's present.

Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-65090
Closes #9054
Closes #9060
2019-09-25 11:27:50 -07:00
Tomohiro Kusumi 4f951b183c Don't directly cast unsigned long to void*
Cast to uintptr_t first for portability on integer to/from pointer
conversion.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9065
2019-09-25 11:27:50 -07:00
Tomohiro Kusumi 65a0b28b42 Fix module_param() type for zfs_read_chunk_size
zfs_read_chunk_size is unsigned long.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9051
2019-09-25 11:27:50 -07:00
Tony Hutter be068aeea8 Move some tests to cli_user/zpool_status
The tests in tests/functional/cli_root/zpool_status should all require
root. However, linux.run has "user =" specified for those tests, which
means they run as a normal user.  When I removed that line to run them
as root, the following tests did not pass:

zpool_status_003_pos
zpool_status_-c_disable
zpool_status_-c_homedir
zpool_status_-c_searchpath

These tests need to be run as a normal user.  To fix this, move these
tests to a new tests/functional/cli_user/zpool_status directory.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #9057
2019-09-25 11:27:50 -07:00
Serapheim Dimitropoulos 1c4b0fc745 Race condition between spa async threads and export
In the past we've seen multiple race conditions that have
to do with open-context threads async threads and concurrent
calls to spa_export()/spa_destroy() (including the one
referenced in issue #9015).

This patch ensures that only one thread can execute the
main body of spa_export_common() at a time, with subsequent
threads returning with a new error code created just for
this situation, eliminating this way any race condition
bugs introduced by concurrent calls to this function.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9015
Closes #9044
2019-09-25 11:27:50 -07:00
Serapheim Dimitropoulos bbbe4b0a98 hdr_recl calls zthr_wakeup() on destroyed zthr
There exists a race condition were hdr_recl() calls
zthr_wakeup() on a destroyed zthr. The timeline is the
following:

[1] hdr_recl() runs first and goes intro zthr_wakeup()
    because arc_initialized is set.
[2] arc_fini() is called by another thread, zeroes
    that flag, destroying the zthr, and goes into
    buf_init().
[3] hdr_recl() tries to enter the destroyed mutex
    and we blow up.

This patch ensures that the ARC's zthrs are not offloaded
any new work once arc_initialized is set and then destroys
them after all of the ARC state has been deleted.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9047
2019-09-25 11:27:50 -07:00
Tomohiro Kusumi 3c144b9267 Fix wrong comment on zcr_blksz_{min,max}
These aren't tunable; illumos has this comment fixed in
"3742 zfs comments need cleaner, more consistent style",
so sync with that.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9052
2019-09-25 11:27:50 -07:00
Brian Behlendorf 428a63cc62 Retire unused spl_{mutex,rwlock}_{init_fini}
These functions are unused and can be removed along
with the spl-mutex.c and spl-rwlock.c source files.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9029
2019-09-25 11:27:49 -07:00
Brian Behlendorf 3982d959c5 Linux 5.3 compat: retire rw_tryupgrade()
The Linux kernel's rwsem's have never provided an interface to
allow a reader to be upgraded to a writer.  Historically, this
functionality has been implemented by a SPL wrapper function.
However, this approach depends on internal knowledge of the
rw_semaphore and is therefore rather brittle.

Since the ZFS code must always be able to fallback to rw_exit()
and rw_enter() when an rw_tryupgrade() fails; this functionality
isn't critical.  Furthermore, the only potentially performance
sensitive consumer is dmu_zfetch() and no decrease in performance
was observed with this change applied.  See the PR comments for
additional testing details.

Therefore, it is being retired to make the build more robust and
to simplify the rwlock implementation.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9029
2019-09-25 11:27:49 -07:00
Brian Behlendorf 54561073e7 Linux 5.3 compat: rw_semaphore owner
Commit https://github.com/torvalds/linux/commit/94a9717b updated the
rwsem's owner field to contain additional flags describing the rwsem's
state.  Rather then update the wrappers to mask out these bits, the
code no longer relies on the owner stored by the kernel.  This does
increase the size of a krwlock_t but it makes the implementation
less sensitive to future kernel changes.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9029
2019-09-25 11:27:49 -07:00
jdike 4c98586daf Fix lockdep recursive locking false positive in dbuf_destroy
lockdep reports a possible recursive lock in dbuf_destroy.

It is true that dbuf_destroy is acquiring the dn_dbufs_mtx
on one dnode while holding it on another dnode.  However,
it is impossible for these to be the same dnode because,
among other things,dbuf_destroy checks MUTEX_HELD before
acquiring the mutex.

This fix defines a class NESTED_SINGLE == 1 and changes
that lock to call mutex_enter_nested with a subclass of
NESTED_SINGLE.

In order to make the userspace code compile,
include/sys/zfs_context.h now defines mutex_enter_nested and
NESTED_SINGLE.

This is the lockdep report:

[  122.950921] ============================================
[  122.950921] WARNING: possible recursive locking detected
[  122.950921] 4.19.29-4.19.0-debug-d69edad5368c1166 #1 Tainted: G           O
[  122.950921] --------------------------------------------
[  122.950921] dbu_evict/1457 is trying to acquire lock:
[  122.950921] 0000000083e9cbcf (&dn->dn_dbufs_mtx){+.+.}, at: dbuf_destroy+0x3c0/0xdb0 [zfs]
[  122.950921]
               but task is already holding lock:
[  122.950921] 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[  122.950921]
               other info that might help us debug this:
[  122.950921]  Possible unsafe locking scenario:

[  122.950921]        CPU0
[  122.950921]        ----
[  122.950921]   lock(&dn->dn_dbufs_mtx);
[  122.950921]   lock(&dn->dn_dbufs_mtx);
[  122.950921]
                *** DEADLOCK ***

[  122.950921]  May be due to missing lock nesting notation

[  122.950921] 1 lock held by dbu_evict/1457:
[  122.950921]  #0: 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs]
[  122.950921]
               stack backtrace:
[  122.950921] CPU: 0 PID: 1457 Comm: dbu_evict Tainted: G           O      4.19.29-4.19.0-debug-d69edad5368c1166 #1
[  122.950921] Hardware name: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011  03/13/2009
[  122.950921] Call Trace:
[  122.950921]  dump_stack+0x91/0xeb
[  122.950921]  __lock_acquire+0x2ca7/0x4f10
[  122.950921]  lock_acquire+0x153/0x330
[  122.950921]  dbuf_destroy+0x3c0/0xdb0 [zfs]
[  122.950921]  dbuf_evict_one+0x1cc/0x3d0 [zfs]
[  122.950921]  dbuf_rele_and_unlock+0xb84/0xd60 [zfs]
[  122.950921]  dnode_evict_dbufs+0x3a6/0x740 [zfs]
[  122.950921]  dmu_objset_evict+0x7a/0x500 [zfs]
[  122.950921]  dsl_dataset_evict_async+0x70/0x480 [zfs]
[  122.950921]  taskq_thread+0x979/0x1480 [spl]
[  122.950921]  kthread+0x2e7/0x3e0
[  122.950921]  ret_from_fork+0x27/0x50

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jeff Dike <jdike@akamai.com>
Closes #8984
2019-09-25 11:27:49 -07:00
Michael Niewöhner ceb516ac2f Add missing __GFP_HIGHMEM flag to vmalloc
Make use of __GFP_HIGHMEM flag in vmem_alloc, which is required for
some 32-bit systems to make use of full available memory.
While kernel versions >=4.12-rc1 add this flag implicitly, older
kernels do not.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #9031
2019-09-25 11:27:49 -07:00
Tomohiro Kusumi 2b9f73e5e6 Use zfsctl_snapshot_hold() wrapper
zfs_refcount_*() are to be wrapped by zfsctl_snapshot_*() in this file.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9039
2019-09-25 11:27:49 -07:00
Brian Behlendorf 984bfb373f Minor style cleanup
Resolve an assortment of style inconsistencies including
use of white space, typos, capitalization, and line wrapping.
There is no functional change.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9030
2019-09-25 11:27:49 -07:00
Brian Behlendorf 446d08fba4 Fix get_special_prop() build failure
The cast of the size_t returned by strlcpy() to a uint64_t by the
VERIFY3U can result in a build failure when CONFIG_FORTIFY_SOURCE
is set.  This is due to the additional hardening.  Since the token
is expected to always fit in strval the VERIFY3U has been removed.
If somehow it doesn't, it will still be safely truncated.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8999
Closes #9020
2019-09-25 11:27:49 -07:00
Antonio Russo af7a5672c3 systemd encryption key support
Modify zfs-mount-generator to produce a dependency on new
zfs-import-key-*.service units, dynamically created at boot to call
zfs load-key for the encryption root, before attempting to mount any
encrypted datasets.

These units are created by zfs-mount-generator, and RequiresMountsFor on
the keyfile, if present, or call systemd-ask-password if a passphrase is
requested.

This patch includes suggestions from @Fabian-Gruenbichler, @ryanjaeb and
@rlaager, as well an adaptation of @rlaager's script to retry on
incorrect password entry.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #8750
Closes #8848
2019-09-25 11:27:49 -07:00
Tomohiro Kusumi 73e50a7d5d Drop redundant POSIX ACL check in zpl_init_acl()
ZFS_ACLTYPE_POSIXACL has already been tested in zpl_init_acl(),
so no need to test again on POSIX ACL access.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9009
2019-09-25 11:27:49 -07:00
Brian Behlendorf d751b12a9d Export dnode symbols
External consumers such as Lustre require access to the dnode
interfaces in order to correctly manipulate dnodes.

Reviewed-by: James Simmons <uja.ornl@yahoo.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8994
Closes #9027
2019-09-25 11:27:49 -07:00
Tom Caputi 78831d4290 Ensure dsl_destroy_head() decrypts objsets
This patch corrects a small issue where the dsl_destroy_head()
code that runs when the async_destroy feature is disabled would
not properly decrypt the dataset before beginning processing.
If the dataset is not able to be decrypted, the optimization
code now simply does not run and the dataset is completely
destroyed in the DSL sync task.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9021
2019-09-25 11:27:49 -07:00
Tomohiro Kusumi 0a223246e1 Disable unused pathname::pn_path* (unneeded in Linux)
struct pathname is originally from Solaris VFS, and it has been used
in ZoL to merely call VOP from Linux VFS interface without API change,
therefore pathname::pn_path* are unused and unneeded. Technically,
struct pathname is a wrapper for C string in ZoL.

Saves stack a bit on lookup and unlink.

(#if0'd members instead of removing since comments refer to them.)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9025
2019-09-25 11:27:49 -07:00
Nick Mattis cf966cb19a Fixes: #8934 Large kmem_alloc
Large allocation over the spl_kmem_alloc_warn value was being performed.
Switched to vmem_alloc interface as specified for large allocations.
Changed the subsequent frees to match.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: nmattis <nickm970@gmail.com>
Closes #8934
Closes #9011
2019-09-25 11:27:49 -07:00
Attila Fülöp 6e19cc77cf Fix ZTS killed processes detection
log_neg_expect was using the wrong exit status to detect if a process
got killed by SIGSEGV or SIGBUS, resulting in false positives.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Attila Fülöp <attila@fueloep.org>
Closes #9003
2019-09-25 11:27:49 -07:00
Shaun Tancheff c3a3c5a30f pkg-utils python sitelib for SLES15
Use python -Esc to set __python_sitelib.

Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Shaun Tancheff <stancheff@cray.com>
Closes #8969
2019-09-25 11:27:49 -07:00
Tomohiro Kusumi ccd8125e45 Fix race in parallel mount's thread dispatching algorithm
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) #8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of #8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) #8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of #8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8450
Closes #8833
Closes #8878
2019-09-25 11:27:49 -07:00
loli10K 2ac233c633 Fix dracut Debian/Ubuntu packaging
This commit ensures make(1) targets that build .deb packages fail if
alien(1) can't convert all .rpm files; additionally it also updates
the zfs-dracut package name which was changed to "noarch" in ca4e5a7.

Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8990
Closes #8991
2019-09-25 11:27:49 -07:00
Tom Caputi 1f72a18f59 Remove VERIFY from dsl_dataset_crypt_stats()
This patch fixes an issue where dsl_dataset_crypt_stats() would
VERIFY that it was able to hold the encryption root. This function
should instead silently continue without populating the related
field in the nvlist, as is the convention for this code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8976
2019-09-25 11:27:49 -07:00
Paul Zuchowski 14a11bf2f6 Improve "Unable to automount" error message.
Having the mountpoint and dataset name both in the message made it
confusing to read.  Additionally, convert this to a zfs_dbgmsg rather than
sending it to the console.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #8959
2019-09-25 11:27:49 -07:00
Brian Behlendorf 7a03d7c73c Check b_freeze_cksum under ZFS_DEBUG_MODIFY conditional
The b_freeze_cksum field can only have data when ZFS_DEBUG_MODIFY
is set.  Therefore, the EQUIV check must be wrapped accordingly.
For the same reason the ASSERT in arc_buf_fill() in unsafe.
However, since it's largely redundant it has simply been removed.

Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8979
2019-09-25 11:27:49 -07:00
Tom Caputi 9e09826b33 Fix error text for EINVAL in zfs_receive_one()
This small patch fixes the EINVAL case for zfs_receive_one(). A
missing 'else' has been added to the two possible cases, which
will ensure the intended error message is printed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8977
2019-09-25 11:27:48 -07:00
Tomohiro Kusumi 093bb64461 Don't use d_path() for automount mount point for chroot'd process
Chroot'd process fails to automount snapshots due to realpath(3)
failure in mount.zfs(8).

Construct a mount point path from sb of the ctldir inode and dirent
name, instead of from d_path(), so that chroot'd process doesn't get
affected by its view of fs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8903
Closes #8966
2019-09-25 11:27:48 -07:00
George Wilson 7d2489cfad nopwrites on dmu_sync-ed blocks can result in a panic
After device removal, performing nopwrites on a dmu_sync-ed block
will result in a panic. This panic can show up in two ways:
1. an attempt to issue an IOCTL in vdev_indirect_io_start()
2. a failed comparison of zio->io_bp and zio->io_bp_orig in
   zio_done()
To resolve both of these panics, nopwrites of blocks on indirect
vdevs should be ignored and new allocations should be performed on
concrete vdevs.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #8957
2019-09-25 11:27:48 -07:00
Alexander Motin 04d4df89f4 Avoid extra taskq_dispatch() calls by DMU
DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes
and os_synced_dnodes.  Since the number of sublists by default is equal
to number of CPUs, it will dispatch equal, potentially large, number of
tasks, waking up many CPUs to handle them, even if only one or few of
sublists actually have any work to do.

This change adds check for empty sublists to avoid this.

Reviewed by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Closes #8909
2019-09-25 11:27:48 -07:00
Igor K 05006f125c -Y option for zdb is valid
The -Y option was added for ztest to test split block reconstruction.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes #8926
2019-09-25 11:27:48 -07:00
Tom Caputi bfe5f029cf Fix error message on promoting encrypted dataset
This patch corrects the error message reported when attempting
to promote a dataset outside of its encryption root.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8905
Closes #8935
2019-09-25 11:27:48 -07:00
Brian Behlendorf cc7fe8a599 Fix out-of-tree build failures
Resolve the incorrect use of srcdir and builddir references for
various files in the build system.  These have crept in over time
and went unnoticed because when building in the top level directory
srcdir and builddir are identical.

With this change it's again possible to build in a subdirectory.

    $ mkdir obj
    $ cd obj
    $ ../configure
    $ make

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8921
Closes #8943
2019-09-25 11:27:48 -07:00
Matthew Ahrens 7d64595c25 dn_struct_rwlock can not be held in dmu_tx_try_assign()
The thread calling dmu_tx_try_assign() can't hold the dn_struct_rwlock
while assigning the tx, because this can lead to deadlock. Specifically,
if this dnode is already assigned to an earlier txg, this thread may
need to wait for that txg to sync (the ERESTART case below).  The other
thread that has assigned this dnode to an earlier txg prevents this txg
from syncing until its tx can complete (calling dmu_tx_commit()), but it
may need to acquire the dn_struct_rwlock to do so (e.g. via
dmu_buf_hold*()).

This commit adds an assertion to dmu_tx_try_assign() to ensure that this
deadlock is not inadvertently introduced.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8929
2019-09-25 11:27:48 -07:00
gordan-bobic be4a282a8d Remove arch and relax version dependency
Remove arch and relax version dependency for zfs-dracut
package.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gordan Bobic <gordan@redsleeve.org>
Issue #8913
Closes #8914
2019-09-25 11:27:48 -07:00
Harry Mallon 2d88230d97 Add libnvpair to libzfs pkg-config
Functions such as `fnvlist_lookup_nvlist` need libnvpair to be linked.
Default pkg-config file did not contain it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Harry Mallon <hjmallon@gmail.com>
Closes #8919
2019-09-25 11:27:48 -07:00
Don Brady 95fcb04215 Let zfs mount all tolerate in-progress mounts
The zfs-mount service can unexpectedly fail to start when zfs
encounters a mount that is in progress. This service uses
zfs mount -a, which has a window between the time it checks if
the dataset was mounted and when the actual mount (via mount.zfs
binary) occurs.

The reason for the racing mounts is that both zfs-mount.target
and zfs-share.target are allowed to execute concurrently after
the import.  This is more of an issue with the relatively recent
addition of parallel mounting, and we should consider serializing
the mount and share targets.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #8881
2019-09-25 11:27:48 -07:00
Allan Jude d053481523 zstreamdump: add per-record-type counters and an overhead counter
Count the bytes of payload for each replication record type

Count the bytes of overhead (replication records themselves)

Include these counters in the output summary at the end of the run.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Sponsored-By: Klara Systems and Catalogic
Closes #8432
2019-09-25 11:27:48 -07:00
Paul Dagnelie 7a5f4656ce Fix comments on zfs_bookmark_phys
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #8945
2019-09-25 11:27:48 -07:00
Paul Dagnelie 1fd28bd8d4 Add SCSI_PASSTHROUGH to zvols to enable UNMAP support
When exporting ZVOLs as SCSI LUNs, by default Windows will not
issue them UNMAP commands. This reduces storage efficiency in
many cases.

We add the SCSI_PASSTHROUGH flag to the zvol's device queue,
which lets the SCSI target logic know that it can handle SCSI
commands.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #8933
2019-09-25 11:27:48 -07:00
Tomohiro Kusumi ab24c9cd4c Prevent pointer to an out-of-scope local variable
`show_str` could be a pointer to a local variable in stack
which is out-of-scope by the time
`return (snprintf(buf, buflen, "%s\n", show_str));`
is called.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8924
Closes #8940
2019-09-25 11:27:48 -07:00
Matthew Ahrens 3c2a42fd25 dedup=verify doesn't clear the blkptr's dedup flag
The logic to handle strong checksum collisions where the data doesn't
match is incorrect. It is not clearing the dedup bit of the blkptr,
which can cause a panic later in zio_ddt_free() due to the dedup table
not matching what is in the blkptr.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-48097
Closes #8936
2019-09-25 11:27:48 -07:00
Igor K 9af524b0ee Update vdev_ops_t from illumos
Align vdev_ops_t from illumos for better compatibility.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes #8925
2019-09-25 11:27:48 -07:00
Tom Caputi b96ceeead2 Allow unencrypted children of encrypted datasets
When encryption was first added to ZFS, we made a decision to
prevent users from creating unencrypted children of encrypted
datasets. The idea was to prevent users from inadvertently
leaving some of their data unencrypted. However, since the
release of 0.8.0, some legitimate reasons have been brought up
for this behavior to be allowed. This patch simply removes this
limitation from all code paths that had checks for it and updates
the tests accordingly.

Reviewed-by: Jason King <jason.king@joyent.com>
Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8737
Closes #8870
2019-09-25 11:27:48 -07:00
dacianstremtan 01cc94f68d Replace whereis with type in zfs-lib.sh
The whereis command should not be used since it may not exist
in the initramfs.  The dracut plymouth module also uses the type
command instead of whereis.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Garrett Fields <ghfields@gmail.com>
Signed-off-by: Dacian Reece-Stremtan <dacianstremtan@gmail.com>
Closes #8920
Closes #8938
2019-09-25 11:27:48 -07:00
Tomohiro Kusumi fb6f6b47d6 Use ZFS_DEV macro instead of literals
The rest of the code/comments use ZFS_DEV, so sync with that.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8912
2019-09-25 11:27:48 -07:00
Michael Niewöhner 2087b6cf49 Fix memory leak in check_disk()
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #8897
Closes #8911
2019-09-25 11:27:47 -07:00
Olaf Faaland 5b0327bc57 kmod-zfs-devel rpm should provide kmod-spl-devel
When configure is run with --with-spec=redhat, and rpms are built, the
kmod-zfs-devel package is missing

Provides: kmod-spl-devel = %{version}

which is required by software such as Lustre which builds against zfs
kmods.  Adding it makes it easier for such software to build against
both zfs-0.7 (where SPL is separate and may be missing) and zfs-0.8.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8930
2019-09-25 11:27:47 -07:00
Brian Behlendorf b5e8d14a4b ZTS: Fix mmp_interval failure
The mmp_interval test case was failing on Fedora 30 due to the built-in
'echo' command terminating the script when it was unable to write to
the sysfs module parameter.  This change in behavior was observed with
ksh-2020.0.0-alpha1.  Resolve the issue by using the external cat
command which fails gracefully as expected.

Additionally, remove some incorrect quotes around the $? return values.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8906
2019-09-25 11:27:47 -07:00
Alexander Motin ed7b0d357a Minimize aggsum_compare(&arc_size, arc_c) calls.
For busy ARC situation when arc_size close to arc_c is desired.  But
then it is quite likely that aggsum_compare(&arc_size, arc_c) will need
to flush per-CPU buckets to find exact comparison result.  Doing that
often in a hot path penalizes whole idea of aggsum usage there, since it
replaces few simple atomic additions with dozens of lock acquisitions.

Replacing aggsum_compare() with aggsum_upper_bound() in code increasing
arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles
allows to save ~5% of CPU time in aggsum code during sequential write
to 12 ZVOLs with 16KB block size on large dual-socket system.

I suppose there some minor arc_p behavior change due to lower precision
of the new code, but I don't think it is a big deal, since it should
affect only very small window in time (aggsum buckets are flushed every
second) and in ARC size (buckets are limited to 10 average ARC blocks
per CPU).

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Closes #8901
2019-09-25 11:27:47 -07:00
Ryan Moeller 9e54b9d930 Python config cleanup
Don't require Python at configure/build unless building pyzfs.
Move ZFS_AC_PYTHON_MODULE to always-pyzfs.m4 where it is used.
Make test syntax more consistent.

Sponsored by: iXsystems, Inc.
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #8895
2019-09-25 11:27:47 -07:00
Matthew Ahrens b033353b25 lz4_decompress_abd declared but not defined
`lz4_decompress_abd` is declared in zio_compress.h but it is not defined
anywhere. The declaration should be removed.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-47477
Closes #8894
2019-09-25 11:27:47 -07:00
Matthew Ahrens 6083f40387 panic in removal_remap test on 4K devices
If the zfs_remove_max_segment tunable is changed to be not a multiple of
the sector size, then the device removal code will malfunction and try
to create mappings that are smaller than one sector, leading to a panic.

On debug bits this assertion will fail in spa_vdev_copy_segment():
    ASSERT3U(DVA_GET_ASIZE(&dst), ==, size);

On nondebug, the system panics with a stack like:
    metaslab_free_concrete()
    metaslab_free_impl()
    metaslab_free_impl_cb()
    vdev_indirect_remap()
    free_from_removing_vdev()
    metaslab_free_impl()
    metaslab_free_dva()
    metaslab_free()

Fortunately, the default for zfs_remove_max_segment is 1MB, so this
can't occur by default.  We hit it during this test because
removal_remap.ksh changes zfs_remove_max_segment to 1KB. When testing on
4KB-sector disks, we hit the bug.

This change makes the zfs_remove_max_segment tunable more robust,
automatically rounding it up to a multiple of the sector size. We also
turn some key assertions into VERIFY's so that similar bugs would be
caught before they are encoded on disk (and thus avoid a
panic-reboot-loop).

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-61342
Closes #8893
2019-09-25 11:27:47 -07:00
Matthew Ahrens 592ee2e6dd compress metadata in later sync passes
Starting in sync pass 5 (zfs_sync_pass_dont_compress), we disable
compression (including of metadata).  Ostensibly this helps the sync
passes to converge (i.e. for a sync pass to not need to allocate
anything because it is 100% overwrites).

However, in practice it increases the average number of sync passes,
because when we turn compression off, a lot of block's size will change
and thus we have to re-allocate (not overwrite) them.  It also increases
the number of 128KB allocations (e.g. for indirect blocks and spacemaps)
because these will not be compressed.  The 128K allocations are
especially detrimental to performance on highly fragmented systems,
which may have very few free segments of this size, and may need to load
new metaslabs to satisfy 128K allocations.

We should increase zfs_sync_pass_dont_compress.  In practice on a highly
fragmented system we see a few 5-pass txg's, a tiny number of 6-pass
txg's, and no txg's with more than 6 passes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-63431
Closes #8892
2019-09-25 11:27:47 -07:00
Alexander Motin cab7d856ea Move write aggregation memory copy out of vq_lock
Memory copy is too heavy operation to do under the congested lock.
Moving it out reduces congestion by many times to almost invisible.
Since the original zio removed from the queue, and the child zio is
not executed yet, I don't see why would the copy need protection.
My guess it just remained like this from the time when lock was not
dropped here, which was added later to fix lock ordering issue.

Multi-threaded sequential write tests with both HDD and SSD pools
with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major
reduction of lock congestion, saving from 15% to 35% of CPU time
and increasing throughput from 10% to 40%.

Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Closes #8890
2019-09-25 11:27:47 -07:00
Tulsi Jain 19cebf0518 Restrict filesystem creation if name referred either '.' or '..'
This change restricts filesystem creation if the given name
contains either '.' or '..'

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: TulsiJain <tulsi.jain@delphix.com>
Closes #8842
Closes #8564
2019-09-25 11:27:47 -07:00
Matthew Ahrens 77e64c6fff ztest: dmu_tx_assign() gets ENOSPC in spa_vdev_remove_thread()
When running zloop, we occasionally see the following crash:

    dmu_tx_assign(tx, TXG_WAIT) == 0 (0x1c == 0)
    ASSERT at ../../module/zfs/vdev_removal.c:1507:spa_vdev_remove_thread()/sbin/ztest(+0x89c3)[0x55faf567b9c3]

The error value 0x1c is ENOSPC.

The transaction used by spa_vdev_remove_thread() should not be able to
fail due to being out of space. i.e. we should not call
dmu_tx_hold_space().  This will allow the removal thread to schedule its
work even when the pool is low on space.  The "slop space" will provide
enough free space to sync out the txg.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-37853
Closes #8889
2019-09-25 11:27:47 -07:00
Tomohiro Kusumi 4f809bddc6 Fix lockdep warning on insmod
sysfs_attr_init() is required to make lockdep happy for dynamically
allocated sysfs attributes. This fixed #8868 on Fedora 29 running
kernel-debug.

This requirement was introduced in 2.6.34.
See include/linux/sysfs.h for what it actually does.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8868
Closes #8884
2019-09-25 11:27:47 -07:00
Matthew Ahrens 516a08ebb4 fat zap should prefetch when iterating
When iterating over a ZAP object, we're almost always certain to iterate
over the entire object. If there are multiple leaf blocks, we can
realize a performance win by issuing reads for all the leaf blocks in
parallel when the iteration begins.

For example, if we have 10,000 snapshots, "zfs destroy -nv
pool/fs@1%9999" can take 30 minutes when the cache is cold. This change
provides a >3x performance improvement, by issuing the reads for all ~64
blocks of each ZAP object in parallel.

Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-58347
Closes #8862
2019-09-25 11:27:47 -07:00
Matthew Ahrens 812c36fc71 Target ARC size can get reduced to arc_c_min
Sometimes the target ARC size is reduced to arc_c_min, which impacts
performance.  We've seen this happen as part of the random_reads
performance regression test, where the ARC size is reduced before the
reads test starts which impacts how long it takes for system to reach
good IOPS performance.

We call arc_reduce_target_size when arc_reap_cb_check() returns TRUE,
and arc_available_memory() is less than arc_c>>arc_shrink_shift.

However, arc_available_memory() could easily be low, even when arc_c is
low, because we can have tons of unused bufs in the abd kmem cache. This
would be especially true just after the DMU requests a bunch of stuff be
evicted from the ARC (e.g. due to "zpool export").

To fix this, the ARC should reduce arc_c by the requested amount, not
all the way down to arc_size (or arc_c_min), which can be very small.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-59431
Closes #8864
2019-09-25 11:27:47 -07:00
bnjf fe11968bbf Fix typo in vdev_raidz_math.c
Fix typo in vdev_raidz_math.c

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brad Forschinger <github@bnjf.id.au>
Closes #8875
Closes #8880
2019-09-25 11:27:47 -07:00
Richard Elling 4be4dedb9f Improve ZTS block_device_wait debugging
The udevadm settle timeout can be 120 or 180 seconds by default
for some distributions. If a long delay is experienced, it could
be due to some strangeness in a malfunctioning device that isn't
related to the devices under test. To help debug this condition,
a notice is given if settle takes too long.

Arguments can now be passed to block_device_wait. The expected
arguments are block device pathnames.

Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #8839
2019-09-25 11:27:47 -07:00
Richard Elling fb52bf9b1d Block_device_wait does not return an error code
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #8839
2019-09-25 11:27:47 -07:00
Richard Elling a22b00f924 Remove redundant redundant remove
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #8839
2019-09-25 11:27:47 -07:00
Richard Elling c350e62309 Fix logic error in setpartition function
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #8839
2019-09-25 11:27:47 -07:00
Paul Dagnelie 6f7bc75825 Allow metaslab to be unloaded even when not freed from
On large systems, the memory used by loaded metaslabs can become
a concern. While range trees are a fairly efficient data structure,
on heavily fragmented pools they can still consume a significant
amount of memory. This problem is amplified when we fail to unload
metaslabs that we aren't using. Currently, we only unload a metaslab
during metaslab_sync_done; in order for that function to be called
on a given metaslab in a given txg, we have to have dirtied that
metaslab in that txg. If the dirtying was the result of an allocation,
we wouldn't be unloading it (since it wouldn't be 8 txgs since it
was selected), so in effect we only unload a metaslab during txgs
where it's being freed from.

We move the unload logic from sync_done to a new function, and
call that function on all metaslabs in a given vdev during
vdev_sync_done().

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #8837
2019-09-25 11:27:47 -07:00
Jorgen Lundman 06900c409b Avoid updating zfs_gitrev.h when rev is unchanged
Build process would always re-compile spa_history.c due to touching
zfs_gitrev.h - avoid if no change in gitrev.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #8860
2019-09-25 11:27:47 -07:00
Allan Jude 60cbc18136 l2arc_apply_transforms: Fix typo in comment
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes #8822
2019-09-25 11:27:46 -07:00
Serapheim Dimitropoulos b63ed49c29 Reduced IOPS when all vdevs are in the zfs_mg_fragmentation_threshold
Historically while doing performance testing we've noticed that IOPS
can be significantly reduced when all vdevs in the pool are hitting
the zfs_mg_fragmentation_threshold percentage. Specifically in a
hypothetical pool with two vdevs, what can happen is the following:
Vdev A would go above that threshold and only vdev B would be used.
Then vdev B would pass that threshold but vdev A would go below it
(we've been freeing from A to allocate to B). The allocations would
go back and forth utilizing one vdev at a time with IOPS taking a hit.

Empirically, we've seen that our vdev selection for allocations is
good enough that fragmentation increases uniformly across all vdevs
the majority of the time. Thus we set the threshold percentage high
enough to avoid hitting the speed bump on pools that are being pushed
to the edge. We effectively disable its effect in the majority of the
cases but we don't remove (at least for now) just in case we hit any
weird behavior in the future.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8859
2019-09-25 11:27:46 -07:00
Tomohiro Kusumi fafe72712a Drop objid argument in zfs_znode_alloc() (sync with OpenZFS)
Since zfs_znode_alloc() already takes dmu_buf_t*, taking another
uint64_t argument for objid is redundant. inode's ->i_ino does and
needs to match znode's ->z_id.

zfs_znode_alloc() in FreeBSD and illumos doesn't have this argument
since vnode doesn't have vnode# in VFS (hence ->z_id exists).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8841
2019-09-25 11:27:46 -07:00
Tomohiro Kusumi 328c95e391 Remove vn_set_fs_pwd()/vn_set_pwd() (no need to be at / during insmod)
Per suggestion from @behlendorf in #8777, remove vn_set_fs_pwd() and
vn_set_pwd() which are only used in zfs_ioctl.c:_init() while loading
zfs.ko.

The rest of initialization functions being called here after cwd set
to / don't depend on cwd of the process except for spa_config_load().
spa_config_load() uses a relative path ".//etc/zfs/zpool.cache" when
`rootdir` is non-NULL, which is "/etc/zfs/zpool.cache" given cwd is /,
so just unconditionally use the absolute path without "./", so that
`vn_set_pwd("/")` as well as the entire functions can be removed.
This is also what FreeBSD does.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8826
2019-09-25 11:27:46 -07:00
Josh Soref 6ce10fdabb grammar: it is / plural agreement
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>
Closes #8818
2019-09-25 11:27:46 -07:00
Tomohiro Kusumi e4a11acfac Refactor parent dataset handling in libzfs zfs_rename()
For recursive renaming, simplify the code by moving `zhrp` and
`parentname` to inner scope. `zhrp` is only used to test existence
of a parent dataset for recursive dataset dir scan since ba6a24026c.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8815
2019-09-25 11:27:46 -07:00
Ryan Moeller 90d8067a77 Update comments to match code
s/get_vdev_spec/make_root_vdev

The former doesn't exist anymore.

Sponsored by: iXsystems, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Ryan Moeller <ryan@freqlabs.com>
Closes #8759
2019-09-25 11:27:46 -07:00
Tomohiro Kusumi e5a877c5d0 Update descriptions for vnops
These descriptions are not uptodate with the code.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8767
2019-09-25 11:27:46 -07:00
Tomohiro Kusumi 4933b0a25b Drop local definition of MOUNT_BUSY
It's accessible via <sys/mntent.h>.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8765
2019-09-25 11:27:46 -07:00
Rafael Kitover e0cd6c28a3 kernel timer API rework
In `config/kernel-timer.m4` refactor slightly to check more generally
for the new `timer_setup()` APIs, but also check the callback signature
because some kernels (notably 4.14) have the new `timer_setup()` API but
use the old callback signature. Also add a check for a `flags` member in
`struct timer_list`, which was added in 4.1-rc8.

Add compatibility shims to `include/spl/sys/timer.h` to allow using the
new timer APIs with the only two caveats being that the callback
argument type must be declared as `spl_timer_list_t` and an explicit
assignment is required to get the timer variable for the `timer_of()`
macro. So the callback would look like this:

```c
__cv_wakeup(spl_timer_list_t t)
{
        struct timer_list *tmr = (struct timer_list *)t;
	struct thing *parent = from_timer(parent, tmr,
		parent_timer_field);
	... /* do stuff with parent */
```

Make some minor changes to `spl-condvar.c` and `spl-taskq.c` to use the
new timer APIs instead of conditional code.

Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rafael Kitover <rkitover@gmail.com>
Closes #8647
2019-09-25 11:27:46 -07:00
Tony Hutter 63b88f7e22 Tag zfs-0.8.1
META file and changelog updated.

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
2019-06-14 09:43:18 -07:00
Alexander Motin 72888812b0 Fix comparison signedness in arc_is_overflowing()
When ARC size is very small, aggsum_lower_bound(&arc_size) may return
negative values, that due to unsigned comparison caused delays, waiting
for arc_adjust() to "fix" it by calling aggsum_value(&arc_size).  Use
of signed comparison there fixes the problem.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Closes #8873
2019-06-11 10:45:23 -07:00
Tom Caputi 581c77e725 Fix incorrect error message for raw receive
This patch fixes an incorrect error message that comes up when
doing a non-forcing, raw, incremental receive into a dataset
that has a newer snapshot than the "from" snapshot. In this
case, the current code prints a confusing message about an IVset
guid mismatch.

This functionality is supported by non-raw receives as an
undocumented feature, but was never supported by the raw receive
code. If this is desired in the future, we can probably figure
out a way to make it work.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #8758
Closes #8863
2019-06-11 10:45:17 -07:00
Eli Schwartz ba505f90d8 arc_summary: prefer python3 version and install when there is no python
This matches the behavior of other python scripts, such as arcstat and
dbufstat, which are always installed but whose install-exec-hook actions
will simply touch up the shebang if a python interpreter was configured
*and* that interpreter is a python2 interpreter.

Fixes installation in a minimal build chroot without python available.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@freqlabs.com>
Signed-off-by: Eli Schwartz <eschwartz@archlinux.org>
Closes #8851
2019-06-11 10:45:02 -07:00
Samuel VERSCHELDE eaa21b2349 Fix %post and %postun generation in kmodtool
During zfs-kmod RPM build, $(uname -r) gets unintentionally evaluated on
the build host, once and for all. It should be evaluated during the
execution of the scriptlets on the installation host. Escaping the $
character avoids evaluating it during build.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Samuel Verschelde <stormi-xcp@ylix.fr>
Closes #8866
2019-06-11 10:44:54 -07:00
Tom Caputi 8dc8bbde6e Reinstate raw receive check when truncating
This patch re-adds a check that was removed in 369aa50. The check
confirms that a raw receive is not occuring before truncating an
object's dn_maxblkid. At the time, it was believed that all cases
that would hit this code path would be handled in other places,
but that was not the case.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8852 
Closes #8857
2019-06-07 12:45:40 -07:00
Garrett Fields 9fd95a2f1b If $ZFS_BOOTFS contains guid, replace the guid portion with $pool
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Garrett Fields <ghfields@gmail.com>
Closes #8356
2019-06-07 12:45:40 -07:00
Tom Caputi 35050ef39e Fix integer overflow of ZTOI(zp)->i_generation
The ZFS on-disk format stores each inode's generation ID as a 64
bit number on disk and in-core. However, the Linux kernel's inode
is only a 32 bit number. In most places, the code handles this
correctly, but the cast is missing in zfs_rezget(). For many pools,
this isn't an issue since the generation ID is computed as the
current txg when the inode is created and many pools don't have
more than 2^32 txgs.

For the pools that have more txgs, this issue causes any inode with
a high enough generation number to report IO errors after a call to
"zfs rollback" while holding the file or directory open. This patch
simply adds the missing cast.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8858
2019-06-07 12:45:40 -07:00
Don Brady 5108d27aec hkdf_test binary should only have one icp instance
The build for test binary hkdf_test was linking both against libicp 
and libzpool. This results in two instances of libicp inside the 
binary but the call to icp_init() only initializes one of them!

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #8850
2019-06-07 12:45:40 -07:00
Peter Wirdemo 02010e9c2c Fixed a small typo in man/man1/raidz_test.1
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Peter Wirdemo <peter.wirdemo@gmail.com>
Closes #8855
2019-06-07 12:45:40 -07:00
Torsten Wörtwein a0bf24952d Allow TRIM_UNUSED_KSYM when build as a builtin-module
If ZFS is built with enable_linux_builtin, it seems to be possible
to compile the kernel with TRIM_UNUSED_KSYM.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Torsten Wörtwein <twoertwein@gmail.com>
Closes #8820
2019-06-07 12:45:40 -07:00
Ryan Moeller d6920fb996 Make Python detection optional and more portable
Previously, --without-python would cause ./configure to fail. Now it is
able to proceed, and the Python scripts will not be built.

Use portable parameter expansion matching instead of nonstandard
substring matching to detect the Python version.  This test is
duplicated in several places, so define a function for it.

Don't assume the full path to binaries, since different platforms do
install things in different places.  Use AC_CHECK_PROGS instead.

When building without Python, also build without pyzfs.

Sponsored by: iXsystems, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Eli Schwartz <eschwartz93@gmail.com>
Signed-off-by: Ryan Moeller <ryan@freqlabs.com>
Closes #8809 
Closes #8731
2019-06-07 12:45:40 -07:00
DeHackEd 58b2de6420 Wait in 'S' state when send/recv pipe is blocking
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #8733 
Closes #8752
2019-06-07 12:45:40 -07:00
TulsiJain 11ad06d1d8 Make zfs_async_block_max_blocks handle zero correctly
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: TulsiJain <tulsi.jain@delphix.com>
Closes #8829
Closes #8289
2019-06-07 12:45:40 -07:00
Brian Behlendorf 4f8eef29e0 Revert "Report holes when there are only metadata changes"
This reverts commit ec4f9b8f30 which introduced a narrow race which
can lead to lseek(, SEEK_DATA) incorrectly returning ENXIO.  Resolve
the issue by revering this change to restore the previous behavior
which depends solely on checking the dirty list.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8816 
Closes #8834
2019-06-07 12:45:40 -07:00
Tomohiro Kusumi 94866d8309 Add link count test for root inode
Add tests for
97aa3ba44("Fix link count of root inode when snapdir is visible")
as suggested in #8727.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8732
2019-06-07 12:45:40 -07:00
Brian Behlendorf a1eaf0dde0 Exclude log device ashift from normal class
When opening a log device during import its allocation bias will
not yet have been set by vdev_load().  This results in the log
device's ashift being incorrectly applied to the maximum ashift
of the vdevs in the normal class.  Which in turn prevents the
removal of any top-level devices due to the ashift check in the
spa_vdev_remove_top_check() function.

This issue is resolved by including vdev_islog in the check since
it will be set correctly during vdev_open().

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8735
2019-06-07 12:45:40 -07:00
madz 580256045b Fix integer overflow in get_next_chunk()
dn->dn_datablksz type is uint32_t and need to be casted to uint64_t
to avoid an overflow when the record size is greater than 4 MiB.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olivier Mazouffre <olivier.mazouffre@ims-bordeaux.fr>
Closes #8778 
Closes #8797
2019-06-07 12:45:40 -07:00
loli10K aaf3b30dcf Double-free of encryption wrapping key due to invalid pool properties
This commits fixes a double-free in zfs_ioc_pool_create() triggered by
specifying an unsupported combination of properties when creating a pool
with encryption enabled.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8791
2019-06-07 12:45:40 -07:00
Stoiko Ivanov 27b446f799 tests: fix cosmetic permission issues during make install
files in dist_*_SCRIPTS get installed with 0755, those in dist_*_DATA
with 0644. This commit moves all .kshlib, .shlib and .cfg files in the
testsuite to dist_pkgdata_DATA, and removes the shebang from
zpool_import.kshlib.

This ensures that the files are installed with appropriate permissions
and silences some warnings from lintian

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
Closes #8803
2019-06-07 12:45:40 -07:00
Stoiko Ivanov 0c6206e7f1 test-runner.py: change shebang to python3
In commit 6e72a5b9b6 python scripts which
work with python2 and python3 changed the shebang from /usr/bin/python
to /usr/bin/python3. This gets adapted by the build-system on systems
which don't provide python3.
This commit changes test-runner.py to also use /usr/bin/python3,
enabling the change during buildtime and fixing a minor lintian issue
for those Debian packages, which depend on a specific python version
(python3/python2).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
Closes #8803
2019-06-07 12:45:40 -07:00
loli10K 51de7ccb42 Endless loop in zpool_do_remove() on platforms with unsigned char
On systems where "char" is an unsigned type the value returned by
getopt() will never be negative (-1), leading to an endless loop:
this issue prevents both 'zpool remove' and 'zstreamdump' for
working on some systems.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8789
2019-06-07 12:45:40 -07:00
Tom Caputi 69ae34076f Fix embedded bp accounting in count_block()
Currently, count_block() does not correctly account for the
possibility that the bp that is passed to it could be embedded.
These blocks shouldn't be counted since the work of scanning
these blocks in already handled when the containing block is
scanned. This patch simply resolves this issue by returning
early in this case.

Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Bill Sommerfeld <sommerfeld@alum.mit.edu>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8800 
Closes #8766
2019-06-07 12:39:13 -07:00
Tom Caputi b746c397e3 Disable parallel processing for 'zfs mount -l'
Currently, 'zfs mount -a' will always attempt to parallelize
work related to mounting as best it can. Unfortunately, when
the user passes the '-l' option to load keys, this causes
all threads to prompt the user for their keys at once,
resulting in a confusing and racy user experience. This patch
simply disables parallel mounting when using the '-l' flag.

Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8762 
Closes #8811
2019-06-07 12:39:13 -07:00
Tomohiro Kusumi 2fb37bcadd Linux 5.2 compat: Directly call wait_on_page_bit()
wait_on_page_writeback() was made GPL only in torvalds/linux@19343b5bdd.

Directly call wait_on_page_bit() without using wait_on_page_writeback()
interface, given zfs_putpage() is the only caller for now.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8794
2019-06-07 12:39:13 -07:00
Tomohiro Kusumi a727f69e52 Linux 5.2 compat: Fix config/kernel-shrink.m4 test failure
"whether ->count_objects callback exists" test failed with
"error: error" message for using an incomplete function shrinker_cb().

This is caused by torvalds/linux@83da1bed86. It's configurable,
but we would want to be able to compile with default kbuild setting.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8776
2019-06-07 12:39:13 -07:00
Tomohiro Kusumi 8ec352be1f Linux 5.2 compat: Remove config/kernel-set-fs-pwd.m4
This failed on 5.2-rc1 with "error: unknown" message, for set_fs_pwd()
not being visible in both const and non-const tests.

This is caused by torvalds/linux@83da1bed86. It's configurable,
but we would want to be able to compile with default kbuild setting.

set_fs_pwd() has never been exported with exception of some distro
kernels, and set_fs_pwd() wasn't used in ZoL to begin with. The test
result was used for a spl function vn_set_fs_pwd().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8777
2019-06-07 12:39:13 -07:00
loli10K df717bb835 zpool: status -t is not documented in help message
This commit adds the undocumented "-t" option to zpool(8) help message.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8782
2019-06-07 12:39:13 -07:00
loli10K e0b3689ed5 zfs-tests: fix warnings when packaging some .shlib files
This change prevents the following warning when packaging some zfs-tests
files:

   *** WARNING: ./usr/src/zfs-0.8.0/tests/zfs-tests/include/zpool_script.shlib
   is executable but has empty or no shebang, removing executable bit

Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8787
2019-06-07 12:39:13 -07:00
loli10K 438275c9a0 VERIFY3P() message is missing a space character
This commit just reintroduces a [space] character inadvertently removed
in a887d653.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8786
2019-06-07 12:39:13 -07:00
loli10K 8cfa6d4a1c zfs-tests: verify zfs(8) and zpool(8) help message is under 80 columns
This commit updates the ZFS Test Suite to detect incorrect wrapping of
both zfs(8) and zpool(8) help message

Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8785
2019-06-07 12:39:13 -07:00
loli10K ad0157ec91 zfs: don't pretty-print objsetid property
The objsetid property, while being stored as a number, is a dataset
identifier and should not be pretty-printed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8784
2019-06-07 12:39:13 -07:00
loli10K cd75d5f710 zfs: missing newline character in zfs_do_channel_program() error message
This commit simply adds a missing newline ("\n") character to the error
message printed by the zfs command when the provided pool parameter
can't be found.

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8783
2019-06-07 12:39:13 -07:00
siv0 c6bbacebc8 Fix ksh-path for random_readwrite_fixed.ksh
The test in zfs-tests/tests/perf/regression/random_readwrite_fixed.ksh
is the only file to use /usr/bin/ksh in the shebang.
Change it to /bin/ksh for consistency.

Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Stoiko Ivanov <s.ivanov@proxmox.com>
Closes #8779
2019-06-07 12:39:13 -07:00
Tomohiro Kusumi 4d7cb872e8 Linux 2.6.39 compat: Test if kstrtoul() exists
kstrtoul() exists only after torvalds/linux@33ee3b2e2e in 2.6.39.
Use strict_strtoul() if kstrtoul() doesn't exist.
Note that strict_strtoul() has existed as an alias for kstrtoul()
for a while, but removed in torvalds/linux@3db2e9cdc0.

It looks like RHEL6 (2.6.32 based) has backported kstrtoul(),
and this caused build CI to pass compilation test.
It should fail on vanilla < 2.6.39 kernels or distro kernels without
backport as reported in #8760.

--
 # grep "kstrtoul(" /lib/modules/2.6.32-754.12.1.el6.x86_64/build/ \
 include/linux/kernel.h >/dev/null
 # echo $?
 0

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8760 
Closes #8761
2019-06-07 12:39:13 -07:00
loli10K f91e7e6284 Device removal panics on 32-bit systems
The issue is caused by an incorrect usage of the sizeof() operator
in vdev_obsolete_sm_object(): on 64-bit systems this is not an issue
since both "uint64_t" and "uint64_t*" are 8 bytes in size. However on
32-bit systems pointers are 4 bytes long which is not supported by
zap_lookup_impl(). Trying to remove a top-level vdev on a 32-bit system
will cause the following failure:

VERIFY3(0 == vdev_obsolete_sm_object(vd, &obsolete_sm_object)) failed (0 == 22)
PANIC at vdev_indirect.c:833:vdev_indirect_sync_obsolete()
Showing stack for process 1315
CPU: 6 PID: 1315 Comm: txg_sync Tainted: P           OE   4.4.69+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
 c1abc6e7 0ae10898 00000286 d4ac3bc0 c14397bc da4cd7d8 d4ac3bf0 d4ac3bd0
 d790e7ce d7911cc1 00000523 d4ac3d00 d790e7d7 d7911ce4 da4cd7d8 00000341
 da4ce664 da4cd8c0 da33fa6e 49524556 28335946 3d3d2030 65647620 626f5f76
Call Trace:
 [<>] dump_stack+0x58/0x7c
 [<>] spl_dumpstack+0x23/0x27 [spl]
 [<>] spl_panic.cold.0+0x5/0x41 [spl]
 [<>] ? dbuf_rele+0x3e/0x90 [zfs]
 [<>] ? zap_lookup_norm+0xbe/0xe0 [zfs]
 [<>] ? zap_lookup+0x57/0x70 [zfs]
 [<>] ? vdev_obsolete_sm_object+0x102/0x12b [zfs]
 [<>] vdev_indirect_sync_obsolete+0x3e1/0x64d [zfs]
 [<>] ? txg_verify+0x1d/0x160 [zfs]
 [<>] ? dmu_tx_create_dd+0x80/0xc0 [zfs]
 [<>] vdev_sync+0xbf/0x550 [zfs]
 [<>] ? mutex_lock+0x10/0x30
 [<>] ? txg_list_remove+0x9f/0x1a0 [zfs]
 [<>] ? zap_contains+0x4d/0x70 [zfs]
 [<>] spa_sync+0x9f1/0x1b10 [zfs]
 ...
 [<>] ? kthread_stop+0x110/0x110

This commit simply corrects the "integer_size" parameter used to lookup
the vdev's ZAP object.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8790
2019-06-07 12:39:13 -07:00
loli10K abe267f677 zpool: trim -p is not a valid option
This commit removes the documented but not handled "-p" option from
zpool(8) help message.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8781
2019-06-07 12:39:13 -07:00
loli10K cc434dcf45 Fix coverity defects: CID 186143
CID 186143: Memory - illegal accesses (USE_AFTER_FREE)

This patch fixes an use-after-free in spa_import_progress_destroy()
moving the kmem_free() call at the end of the function.

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8788
2019-06-07 12:39:13 -07:00
Igor K e2e7b0a2cd Rename reservation tests from *.sh to *.ksh
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes #8729
2019-06-07 12:39:13 -07:00
Richard Elling 78fac8d925 Fix kstat state update during pool transition
When reading kstats, the health (aka state) of the pool is stored into
/proc/spl/kstat/zfs/POOLNAME/state via spa_state_to_name().
However, during import/export there is a case where the spa exists,
but the root vdev does not exist. This fix checks that case and sets
the state to "TRANSITIONING"

Unfortunately, it is not easy to reproduce a test for this. It was
detected randomly during ZTS runs while kstats were also being sampled
regularly. After this change, further testing did not trip on the case
and the TRANSITIONING state was collected at least once by the kstats.

For posterity, the backtrace prior to this fix is:
[Mon May 13 17:21:00 2019] RIP: 0010:spa_state_to_name+0x10/0xb0 [zfs]
...
Mon May 13 17:21:00 2019] Call Trace:
[Mon May 13 17:21:00 2019]  spa_state_data+0x1a/0x40 [zfs]
[Mon May 13 17:21:00 2019]  kstat_seq_show+0x117/0x440 [spl]
[Mon May 13 17:21:00 2019]  seq_read+0xe5/0x430
[Mon May 13 17:21:00 2019]  proc_reg_read+0x45/0x70
[Mon May 13 17:21:00 2019]  __vfs_read+0x1b/0x40
[Mon May 13 17:21:00 2019]  vfs_read+0x8e/0x130
[Mon May 13 17:21:00 2019]  SyS_read+0x55/0xc0
[Mon May 13 17:21:00 2019]  ? SyS_fcntl+0x5d/0xb0
[Mon May 13 17:21:00 2019]  do_syscall_64+0x73/0x130
[Mon May 13 17:21:00 2019]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #8746
2019-05-23 14:28:53 -07:00
Brian Behlendorf bff2361aeb Linux 5.2 compat: rw_tryupgrade()
Commit torvalds/linux@46ad0840b has removed the architecture specific
rwsem source and headers leaving only the generic version.  As part
of this change the RWSEM_ACTIVE_READ_BIAS and RWSEM_ACTIVE_WRITE_BIAS
macros were moved to the private kernel/locking/rwsem.h header.
This results in a build failure because these macros were required
to implement the rw_tryupgrade() compatibility function.

In practice, this isn't a major problem because there are only a
few consumers of rw_tryupgrade() and because consumers of rw_tryupgrade
should be written to retry using rw_enter(RW_WRITER).

After auditing all of the callers only dmu_zfetch() was determined
not to perform a retry.  It has been updated in this commit to
resolve this issue.

That said, the rw_tryupgrade() functionality should be considered
for possible removal in a future release due to the difficultly
in supporting the interface.

Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8730
2019-05-23 13:46:33 -07:00
Brian Behlendorf e34c3ee2fc Tag 0.8.0
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2019-05-21 11:11:41 -07:00
Ryan Moeller 9dc41a769d Fix wrong assertion in libzfs diff error handling
In compare(), all error cases set the error code to EPIPE, so when an
error is set, the correct assertion to make is that the error is EPIPE,
not EINVAL.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@freqlabs.com>
Closes #8743
2019-05-19 17:31:54 -07:00
Paul Dagnelie e61b53475e Fix incorrect assertion in dnode_dirty_l1range
The db_dirtycnt of an EVICTING dbuf is always 0. However, it still 
appears in the dn_dbufs tree. If we call dnode_dirty_l1range on a 
range that contains an EVICTING dbuf, we will attempt to mark it dirty 
(which will fail because it's EVICTING, resulting in a new dbuf being 
created and dirtied). Later, in ZFS_DEBUG mode, we assert that all the 
dbufs in the range are dirty. If the EVICTING dbuf is still present, 
this will trip the assertion erroneously.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #8745
2019-05-19 17:30:33 -07:00
Brian Behlendorf f378f42b53 Tag 0.8.0-rc5
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2019-05-09 10:34:05 -07:00
Olaf Faaland ca95f70dff zpool import progress kstat
When an import requires a long MMP activity check, or when the user
requests pool recovery, the import make take a long time.  The user may
not know why, or be able to tell whether the import is progressing or is
hung.

Add a kstat which lists all imports currently being processed by the
kernel (currently only one at a time is possible, but the kstat allows
for more than one).  The kstat is /proc/spl/kstat/zfs/import_progress.

The kstat contents are as follows:
pool_guid         load_state multihost_secs  max_txg pool_name
16667015954387398 3          15              0       tank3

load_state: the value of spa_load_state
multihost_secs:  seconds until the end of the multihost activity
                 check; if over, or none required, this is 0
max_txg: current spa_load_max_txg, if rewind is occurring

This could be used by outside tools, such as a pacemaker resource agent,
to report import progress, or as a part of manual troubleshooting.  The
zpool import subcommand could also be modified to report this
information.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8696
2019-05-09 10:08:05 -07:00
Tomohiro Kusumi b689de85e8 Add missing trailing '\n' in printk() messages
These messages will want '\n' like any other regular printk() messages.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8726
2019-05-08 16:43:55 -07:00
Alexander Motin 4bb40c1c82 Fix dataset name comparison in zfs_compare()
The code never returned match comparing two datasets (not snapshots).
As result, uu_avl_find(), called from zfs_callback(), never succeeded,
allowing to add same dataset into the list multiple times, for example:

	# zfs get name pers pers pers@z pers@z
	NAME    PROPERTY  VALUE   SOURCE
	pers    name      pers    -
	pers    name      pers    -
	pers@z  name      pers@z  -

With the patch:

	# zfs get name pers pers pers@z pers@z
	NAME    PROPERTY  VALUE   SOURCE
	pers    name      pers    -
	pers@z  name      pers@z  -

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Closes #8723
2019-05-08 16:42:39 -07:00
Tomohiro Kusumi 97aa3ba44f Fix link count of root inode when snapdir is visible
Given how zfs_getattr() is implemented, zfs_getattr_fast() (used by
->getattr() of zpl inodes) also needs to consider an additional link
count if "snapdir" property is set to "visible".

Without this, # of directories in root inode of each dataset doesn't
match the link count when snapdir is visible.

Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8727
2019-05-08 16:40:51 -07:00
JMoVS e2dddb7e58 Fix typesetting of Errata #4
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Justin Scholz <git@justinscholz.de>
Closes #8712 
Closes #8721
2019-05-08 16:04:45 -07:00
Richard Laager 3b770842d5 Add zol2zfs-patch.sed to Makefile.am and sort
In adding man-dates.sh, I noticed that zol2zfs-patch.sed was missing,
even though zfs2zol-patch.sed was present.  Also, the list was not
sorted, so I sorted it.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8710
2019-05-08 11:00:11 -07:00
Richard Laager e7ce9759ac Correct man page dates
Various changes (many by me) have been made to the man pages without
bumping their dates.  I have now corrected them based on the last commit
to each file.  I also added the script I used to make these changes.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8710
2019-05-08 10:59:32 -07:00
Brian Behlendorf a20f43b51b Linux 5.0 compat: ASM_BUG macro
The 5.0 kernel defines the macro ASM_BUG.  In order to prevent a
conflict and build failure rename ASM_BUG to ZFS_ASM_BUG.  This
is currently only an issue on aarch64 but all instances of
ASM_BUG we're renamed to avoid any future conflict on x86_64.

Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8725 
Issue #8545
2019-05-08 10:18:40 -07:00
Brian Behlendorf 515ddf6504 Fix errant EFAULT during writes (#8719)
Commit 98bb45e resolved a deadlock which could occur when
handling a page fault in zfs_write().  This change added
the uio_fault_disable field to the uio structure but failed
to initialize it to B_FALSE.  This uninitialized field would
cause uiomove_iov() to call __copy_from_user_inatomic()
instead of copy_from_user() resulting in unexpected EFAULTs.

Resolve the issue by fully initializing the uio, and clearing
the uio_fault_disable flags after it's used in zfs_write().

Additionally, reorder the uio_t field assignments to match
the order the fields are declared in the  structure.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8640 
Closes #8719
2019-05-08 10:04:04 -07:00
DeHackEd 1f02ecc5a5 Make zfs_special_class_metadata_reserve_pct into a parameter
Exported and documented a new module parameter.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #8706
2019-05-07 15:34:42 -07:00
Brian Behlendorf caf9dd209f Fix send/recv lost spill block
When receiving a DRR_OBJECT record the receive_object() function
needs to determine how to handle a spill block associated with the
object.  It may need to be removed or kept depending on how the
object was modified at the source.

This determination is currently accomplished using a heuristic which
takes in to account the DRR_OBJECT record and the existing object
properties.  This is a problem because there isn't quite enough
information available to do the right thing under all circumstances.
For example, when only the block size changes the spill block is
removed when it should be kept.

What's needed to resolve this is an additional flag in the DRR_OBJECT
which indicates if the object being received references a spill block.
The DRR_OBJECT_SPILL flag was added for this purpose.  When set then
the object references a spill block and it must be kept.  Either
it is update to date, or it will be replaced by a subsequent DRR_SPILL
record.  Conversely, if the object being received doesn't reference
a spill block then any existing spill block should always be removed.

Since previous versions of ZFS do not understand this new flag
additional DRR_SPILL records will be inserted in to the stream.
This has the advantage of being fully backward compatible.  Existing
ZFS systems receiving this stream will recreate the spill block if
it was incorrectly removed.  Updated ZFS versions will correctly
ignore the additional spill blocks which can be identified by
checking for the DRR_SPILL_UNMODIFIED flag.

The small downside to this approach is that is may increase the size
of the stream and of the received snapshot on previous versions of
ZFS.  Additionally, when receiving streams generated by previous
unpatched versions of ZFS spill blocks may still be lost.

OpenZFS-issue: https://www.illumos.org/issues/9952
FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8668
2019-05-07 15:18:44 -07:00
Tomohiro Kusumi 9c53e51616 Fix zfs set atime|relatime=off|on behavior on inherited datasets
`zfs set atime|relatime=off|on` doesn't disable or enable the property
on read for datasets whose property was inherited from parent, until
a dataset is once unmounted and mounted again.

(The properties start to work properly if a dataset is once unmounted
and mounted again. The difference comes from regular mount process,
e.g. via zpool import, uses mount options based on properties read
from ondisk layout for each dataset, whereas
`zfs set atime|relatime=off|on` just remounts a specified dataset.)

--
 # zpool create p1 <device>
 # zfs create p1/f1
 # zfs set atime=off p1
 # echo test > /p1/f1/test
 # sync
 # zfs list
 NAME    USED  AVAIL     REFER  MOUNTPOINT
 p1      176K  18.9G     25.5K  /p1
 p1/f1    26K  18.9G       26K  /p1/f1
 # zfs get atime
 NAME   PROPERTY  VALUE  SOURCE
 p1     atime     off    local
 p1/f1  atime     off    inherited from p1
 # stat /p1/f1/test | grep Access | tail -1
 Access: 2019-04-26 23:32:33.741205192 +0900
 # cat /p1/f1/test
 test
 # stat /p1/f1/test | grep Access | tail -1
 Access: 2019-04-26 23:32:50.173231861 +0900
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ changed by read(2)
--

The problem is that zfsvfs::z_atime which was probably intended to keep
incore atime state just gets updated by a callback function of "atime"
property change, atime_changed_cb(), and never used for anything else.

Since now that all file read and atime update use a common function
zpl_iter_read_common() -> file_accessed(), and whether to update atime
via ->dirty_inode() is determined by atime_needs_update(),
atime_needs_update() needs to return false once atime is turned off.
It currently continues to return true on `zfs set atime=off`.

Fix atime_changed_cb() by setting or dropping SB_NOATIME in VFS super
block depending on a new atime value, so that atime_needs_update() works
as expected after property change.

The same problem applies to "relatime" except that a self contained
relatime test is needed. This is because relatime_need_update() is based
on a mount option flag MNT_RELATIME, which doesn't exist in datasets
with inherited "relatime" property via `zfs set relatime=...`, hence it
needs its own relatime test zfs_relatime_need_update().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8674 
Closes #8675
2019-05-07 10:06:30 -07:00
Tomohiro Kusumi 75346937de Linux 5.1 compat: Drop ULLONG_MAX and LLONG_MAX definitions
Linux kernel commit 54d50897d544c874562253e2a8f70dfcad22afe8
"linux/kernel.h: split *_MAX and *_MIN macros into <linux/limits.h>"

which first appeared in 5.1 has moved several macros from
<linux/kernel.h> to <linux/limits.h>. This broke compilation due to
header inclusion order against the local header include/spl/sys/types.h
which also defines ULLONG_MAX and LLONG_MAX if undefined.

It looks like local ULLONG_MAX and LLONG_MAX were never needed
(or after spl integration ?) as <linux/kernel.h> has had the same
definitions since an upstream commit
111ebb6e6f7bd7de6d722c5848e95621f43700d9 in 2.6.18, so drop them.

--
linux/include/linux/limits.h:17: error: "LLONG_MAX" redefined [-Werror]
 #define LLONG_MAX ((long long)(~0ULL >> 1))
zfs/include/spl/sys/types.h:35: note: this is the location of the previous definition
 #define LLONG_MAX  ((long long)(~0ULL>>1))

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8714
2019-05-07 09:55:40 -07:00
Richard Laager c6eaa14620 Cleanup special/dedup language
This standardizes the language on "deduplication tables" rather than
"dedup data" (which might be read as the data blocks rather than the
DDT).  Likewise, it standardizes on "small file blocks".  It also
standardizes on "normal" rather than using both "normal" and "general"
in the same paragraph. I also replaced "non-specified" with the more
explicit "non-dedup/special".

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8713
2019-05-07 09:51:09 -07:00
Antonio Russo 6aff30ad80 Fix zfs-mount-generator for datasets with spaces
Alternative implementation of @rlaager's original modification
of zfs-mount-generator fix, with @chrisrd's comments. Set
IFS to be only the tab character, matching our `-H` call in
`zfs list`, allowing spaces to appear in dataset names (and
mountpoints).

Also adds comments explaining our rationale.

Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #8708 
Closes #8718
2019-05-07 09:32:23 -07:00
Richard Laager bca06413f7 OpenZFS 10473 - zfs(1M) missing cross-reference to zfs-program(1M)
Authored by: Jason King <jason.king@joyent.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Andy Fiddaman <andy@omniosce.org>
Reviewed by: Peter Tribble <peter.tribble@gmail.com>
Reviewed by: Gergő Mihály Doma <domag02@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Richard Laager <rlaager@wiktel.com>

OpenZFS-issue: https://www.illumos.org/issues/10473
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/736e67003
Closes #8711
2019-05-06 10:37:18 -07:00
Tomohiro Kusumi a762893269 Fix typo/etc in module/zfs/zfs_ctldir.c
Drop duplicated phrases in comments.

Also drop an obsolete comment "Perform a mount of the associated...",
as all it does now is get objid from DMU and lookup incore inode.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8707
2019-05-05 15:57:23 -07:00
Tomohiro Kusumi de3e0b914b Linux 5.0 compat: Use totalhigh_pages()
Linux kernel commit ca79b0c211af63fa3276f0e3fd7dd9ada2439839
"mm: convert totalram_pages and totalhigh_pages variables to atomic"

replaced `totalhigh_pages` with an inline function `totalhigh_pages()`.
This broke compilation on IA32, etc, as ZoL uses `totalhigh_pages`
on archs with highmem. Confirmed on Fedora 30 (5.0.9-301.fc30.i686).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8677
Closes #8701
2019-05-04 16:40:48 -07:00
John Gallagher 1eacf2b3b0 Improve rate at which new zvols are processed
The kernel function which adds new zvols as disks to the system,
add_disk(), briefly opens and closes the zvol as part of its work.
Closing a zvol involves waiting for two txgs to sync. This, combined
with the fact that the taskq processing new zvols is single threaded,
makes this processing new zvols slow.

Waiting for these txgs to sync is only necessary if the zvol has been
written to, which is not the case during add_disk(). This change adds
tracking of whether a zvol has been written to so that we can skip the
txg_wait_synced() calls when they are unnecessary.

This change also fixes the flags passed to blkdev_get_by_path() by
vdev_disk_open() to be FMODE_READ | FMODE_WRITE | FMODE_EXCL instead of
just FMODE_EXCL. The flags were being incorrectly calculated because
we were using the wrong version of vdev_bdev_mode().

Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #8526 
Closes #8615
2019-05-04 16:39:10 -07:00
JMoVS b3b60984ee Clearer wording on Errata #4
Users of existing pools, especially pools with top-level encrypted 
datasets, could run into trouble trying to work around Errata #4. 
Clarify that removing encrypted snapshots and bookmarks is enough
to clear the errata.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Justin Scholz <git@justinscholz.de>
Closes #8682 
Closes #8683
2019-05-02 16:52:57 -07:00
Matthew Ahrens 8d9f616511 Reword comment in lz4_compress_zfs
The comment in lz4_compress_zfs could be more clear and specific.  It
also contains needlessly strong language.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes: #8702
Closes: #8703
2019-05-02 16:46:04 -07:00
Tom Caputi fa24166074 Add feature check for 'zpool resilver' command
The 'zpool resilver' command requires that the resilver_defer
feature is active on the pool. Unfortunately, the check for
this was left out of the original patch. This commit simply
corrects this so that the command properly returns an error
in this case.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8700
2019-05-02 16:42:31 -07:00
Tom Caputi 85bdc68401 Fix estimated scrub completion time
Currently, it is possible for the 'zpool scrub' command to
progress slightly beyond 100% due to concurrent changes
happening on the live pool. This behavior is expected, but
the userspace code for 'zpool status' would subtract the
expected amount of data from the amount of data already
scrubbed, resulting in a negative integer being casted to a
large positive one. This number was then used to calculate
the estimated completion time, resulting in wildly wrong
results. This code changes the behavior so that 'zpool status'
does not attempt to report an estimate during this period.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8611 
Closes #8687
2019-05-01 17:34:24 -07:00
Matthew Ahrens 6bdefad311 Remove incorrect (and inappropriate) comment in dprintf_dnode
This comment seems to misunderstand the ## preprocessor token, which
does token concatenation.  It is not needed here, since we are
concatenating string literals, which is performed by putting the
literals next to each other.

Additionally, the comment uses offensive language.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8698
Closes #8699
2019-05-01 17:32:54 -07:00
Tomohiro Kusumi 2a15c00f89 Use sigaction(2) instead of sigignore(3) for portability
sigignore(3) isn't portable.
This code fails to compile on platforms without sigignore(3).
Use sigaction(2).

--
zfs_main.c: In function 'zfs_do_diff':
zfs_main.c:7178:9: error: implicit declaration of function 'sigignore' [-Werror=implicit-function-declaration]
  (void) sigignore(SIGPIPE);
         ^~~~~~~~~

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8593
2019-04-30 20:46:15 -07:00
Tomohiro Kusumi 5b1443c47e Use sigaction(2) instead of sigset(3) for portability
sigset(3) isn't portable.
This code fails to compile on platforms without sigset(3).
Use sigaction(2).

--
largest_file.c: In function 'main':
largest_file.c:75:9: error: implicit declaration of function 'sigset'; did you mean 'sigvec'? [-Werror=implicit-function-declaration]
  (void) sigset(SIGXFSZ, sigxfsz);
         ^~~~~~
         sigvec

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8593
2019-04-30 20:45:17 -07:00
Tomohiro Kusumi f0ce0436aa Correct snprintf() size argument
The size argument of snprintf(3) in glibc and snprintf() in Linux
kernel includes trailing \0, as snprintf(3) man page explains it as
"write at most size bytes (including the trailing null byte ('\0'))",
i.e. snprintf() can just take buffer size.

e.g. For snprintf() in module/zfs/zfs_ctldir.c, a buffer size is
MAXPATHLEN, and a caller is passing MAXPATHLEN to snprintf(), so size
should just be `path_len` to do what the caller is trying to do.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8692
2019-04-30 19:41:12 -07:00
Richard Laager 77449a1ab0 Clarify that deduped data is encrypted
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8691
2019-04-30 13:53:54 -07:00
Brian Behlendorf 294fcb543e Add CODE_OF_CONDUCT.md
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8660
2019-04-30 10:58:45 -07:00
Brian Behlendorf c12ea77865 Linux 5.0 compat: Remove incorrect ASSERT
Not all block devices, notably scsi_debug, set a root_blkg on the
request queue.  Remove this assertion and allow the the existing
call to blkg_tryget() to gracefully handle the NULL (which it does).

Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8678
2019-04-29 18:20:21 -07:00
Tomohiro Kusumi b43a27f76f Use NV_ENCODE_NATIVE for nvlist encoding variable
Use NV_ENCODE_NATIVE for nvlist encoding variable instead of 0.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8653
2019-04-26 11:24:31 -07:00
Tomohiro Kusumi 9dfe4b80f0 Prevent make distclean removing config/config.rpath
`make distclean` removes an empty file config/config.rpath.
Avoid that by adding some text.

Also see e1245d83e9("Prevent `make distclean` removing 0 sized file").

--
 # find . -size 0
 ./config/config.rpath
 # ./autogen.sh && ./configure
 # git diff
 # make distclean
 # git diff
 diff --git a/config/config.rpath b/config/config.rpath
 deleted file mode 100644
 index e69de29bb..000000000

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8665
2019-04-26 11:22:14 -07:00
Tomohiro Kusumi 126d0fa733 Use SEEK_{SET,CUR,END} for file seek "whence"
Use either SEEK_* or 0,1,2..., but not both.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8656
2019-04-25 10:17:27 -07:00
Tom Caputi f4c594da94 Fixes for the DMU free throttle
This patch fixes 2 issues with the DMU free throttle implemented
in dmu_free_long_range(). The first issue is that get_next_chunk()
was calculating the number of L1 blocks the free would dirty
incorrectly. In some cases involving extremely large files, this
code would greatly overestimate the number of effected L1 blocks,
causing excessive calls to txg_wait_open(). This patch corrects
the calculation.

The second issue is that the free throttle uses the total number
of free'd blocks in all (open, quiescing, and syncing) txgs to
determine whether to throttle. This causes large frees (such as
those created by the first issue) to cause 4 txg syncs before
any further frees were allowed to proceed. This patch ensures
that the accounting is done entirely in a per-txg fashion, so
that frees from a given txg don't affect those that immediately
follow it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8655
2019-04-25 10:16:24 -07:00
Richard Laager 2b127afb44 Clarify and improve encryption documentation
- Remove the language that "all user data" is encrypted.  This is to
  avoid misunderstandings or arguments about what is "user data",
  especially in light of "user properties".
- Document that properties are unencrypted.
- Document that snapshot names are unencrypted.
- For consistency with the rest of the zfs.8 man page, use "ZFS" as the
  generic noun, not (bolded) "zfs".  The latter refers to the command.
  Likewise, use "ZFS" instead of "the kernel module".
- Give "a passphrase" as an example of a "user's key".

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8652
2019-04-24 17:14:24 -07:00
Richard Laager 6e81f9b21b Duplicate the encryption copies=3 limitation
This adds the encryption copies=3 limitation language into the copies
property section.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8651
2019-04-24 17:12:14 -07:00
Richard Laager 7b337fda40 Deprecate dedupditto
This documents, in zpool.8, that dedupditto is deprecated and will be
made to have no effect in a future release.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8650
2019-04-24 17:10:47 -07:00
Richard Laager 0531c83c18 Eliminate useless double-bolding in man pages
As far as I know and can tell from testing, \fB\fB...\fR\fR is exactly
equivalent to \fB...\fR.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8641
2019-04-24 17:04:35 -07:00
Richard Laager a5a6d82dda Alphabetize zpool-features.5 by short name
The features are sorted in the en_US locale, not the C locale.
Specifically, that means that bookmark_v2 comes _after_ bookmarks.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8641
2019-04-24 17:04:28 -07:00
Richard Laager 6f07780147 Standardize .RE placement in zpool-features.5
This command is being used to unindent, so it should be at the end of
each block.  This is consistent with the other man pages.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8641
2019-04-24 17:04:20 -07:00
Richard Laager 1644708fdc Add missing formatting to sha512 in zpool-features.5
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8641
2019-04-24 17:04:13 -07:00
Richard Laager e1822912a9 Correct GUID of large_blocks in zpool-features.5
It is org.open-zfs:large_blocks (plural).

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8641
2019-04-24 17:04:06 -07:00
Tom Caputi d0c3aa9cdd Change wording in zfs-module-parameters.5
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8641
2019-04-24 17:03:59 -07:00
Richard Laager 063edd7fd8 Document that hole_birth is effectively useless
The first sentence of this commit comes from the wiki, and was
originally written by:
Rich Ercolani <rincebrain@gmail.com>
with changes by:
Tom Caputi <tcaputi@datto.com>

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8641
Closes #8642
2019-04-24 17:03:27 -07:00
Richard Laager 6ce7b2d9ad Document send_holes_without_birth_time
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8641
2019-04-24 17:03:21 -07:00
Richard Laager e1fbe77110 Correct bookmark_v2 dependencies
encryption depends on bookmark_v2.

bookmark_v2 depends on bookmarks.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8641
2019-04-24 17:03:12 -07:00
Richard Laager 4d256e897a Fix formatting for multi_vdev_crash_dump in zpool-features.5
This needs to use tabs instead of spaces to display correctly (i.e. with
things lined up).

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8641
2019-04-24 17:02:45 -07:00
Tomohiro Kusumi 2de17298de Fix incorrect use of .Nm directive for ZPOOL_VDEV_NAME_GUID in zpool(8)
It should only affect "zpool".

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8644
2019-04-23 10:23:00 -07:00
Rafael Kitover e8864b1b28 config: libintl/libiconv for gettext() detection
Detect in autoconf whether `-lintl` and possibly `-liconv` are necessary
for translation functions like `gettext()`.

The actual autoconf code is just:

```
AM_ICONV
AM_GNU_GETTEXT([external])
LIBS="$LIBS $LTLIBINTL $LTLIBICONV"
```

References:

https://www.gnu.org/software/gettext/manual/html_node/AM_005fGNU_005fGETTEXT.html
https://www.gnu.org/software/gettext/manual/html_node/AM_005fICONV.html

The reason to check for `libiconv` and add it separately is that this is
sometimes necessary if users are linking statically.

The `config/*.m4` files were added by running `gettextize` and removing
everything else.

The empty file `config/config.rpath` is necessary to avoid an error with
some versions of autotools, see:

http://ramblingfoo.blogspot.com/2007/07/required-file-configrpath-not-found.html

The `config.rpath` copied by `gettextize` does not currently work, there
is some kind of missing interaction with `libtool` and it tries to apply
`libtool` flags to the compiler.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Rafael Kitover <rkitover@gmail.com>
Closes #8554
2019-04-19 12:09:29 -07:00
Tomohiro Kusumi 34d343c2a8 Drop unused ZNODE_STATS and ZNODE_STAT_ADD()
Unused since 5649246dd3("Remove znode move functionality"),
and ZNODE_STAT_ADD() will never be needed.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8636
2019-04-19 12:05:15 -07:00
Tomohiro Kusumi f8b2ca6b1c Fix typo "/zbin/zpool" -> "/sbin/zpool"
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8643
2019-04-19 12:04:21 -07:00
Tomohiro Kusumi a35c12073c Fix incorrect "[UNUSED]" comments
These aren't unused.
`flag` in zfs_create() also isn't to indicate large file.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8635
2019-04-19 12:03:32 -07:00
Brian Behlendorf 17cbc2e62b Tag 0.8.0-rc4
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2019-04-16 13:24:49 -07:00
cfzhu 5090f72743 Code improvement and bug fixes for QAT support
1. Support QAT when ZFS is root file-system:
   When ZFS module is loaded before QAT started, the QAT can
   be started again in post-process, e.g.:
   echo 0 > /sys/module/zfs/parameters/zfs_qat_compress_disable
   echo 0 > /sys/module/zfs/parameters/zfs_qat_encrypt_disable
   echo 0 > /sys/module/zfs/parameters/zfs_qat_checksum_disable
2. Verify alder checksum of the de-compress result
3. Allocate Digest, IV and AAD buffer in physical contiguous
   memory by QAT_PHYS_CONTIG_ALLOC.
4. Update the documentation for zfs_qat_compress_disable,
   zfs_qat_checksum_disable, zfs_qat_encrypt_disable.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com>
Closes #8323 
Closes #8610
2019-04-16 12:38:36 -07:00
Richard Laager 59f6594cf6 Restructure vec_idx loops
This replaces empty for loops with while loops to make the code easier
to read.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: github.com/dcb314
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #6681
Closes #6682
Closes #6683
Closes #8623
2019-04-16 12:34:06 -07:00
TerraTech 50478c6dad Add option [-V|--version] to emit version string
Add the 'zfs version' and 'zpool version' subcommands to display
the version of the user space utilities and loaded zfs kernel
module.  For example:

$ zfs version
zfs-0.8.0-rc3_169_g67e0366b88
zfs-kmod-0.8.0-rc3_169_g67e0366b88

The '-V' and '--version' aliases were added to support the
common convention of using 'zfs --version` to obtain the version
information.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: TerraTech <1118433+TerraTech@users.noreply.github.com>
Closes #2501
Closes #8567
2019-04-16 12:24:06 -07:00
Richard Laager 8750edf1f7 zfs allow refreservation needed for zfs create -V
When creating a non-sparse volume, zfs create sets a refreservation.
Accordingly, one needs the "refreservation" ability in addition to the
"create" ability in order to create a non-sparse volume.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: github.com/homerlinux
Reported-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8531
Closes #8624
2019-04-16 10:12:08 -07:00
Richard Laager 6c0f78f8a3 Clarify GRUB's lack of support for sha512, skein, edonr
zfs.8 correctly said that GRUB did not support them, but
zpool-features.5 said that "Booting off pools...is supported."  Now,
zpool-features.5 discusses GRUB specifically and indicates its lack of
support for these features.  Also, I have clarified the wording in both
places to indicate that the pool feature cannot be used.  It's not a
filesystem dataset thing, but pool-wide.

I described this as "cannot be used".  I think technically the feature
can be enabled, just not active.  However, the effect is essentially the
same: you cannot enable those checksum algorithms on any dataset in the
pool, so you might as well not enable the feature (which is just
pointing a loaded gun at your foot).  In the past, an argument could be
made that having all the features enabled was useful for simplicity, as
long as you didn't activate the GRUB-incompatible features, but that's
getting less and less realistic over time.  A user can still do that,
but we should not encourage that.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
Closes #8446
2019-04-16 10:02:46 -07:00
Richard Laager fcf21f8fcb Update a comment to match the code
GRUB supports large_blocks.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:02:33 -07:00
Richard Laager 8dda07b33c Reference zfeature.c in a SPA_VERSION comment
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:02:19 -07:00
Richard Laager 7698c4eca9 Remove zfs.h comments about GRUB
Nobody is going to be bumping SPA_VERSION again, as OpenZFS has moved on
to feature flags.  Also, there is no requirement to keep GRUB
up-to-date, nor has that been happening.

The ZPL_VERSION could be bumped, but that would likely be handled in a
similar way, by adding filesystem feature flags.  In any event, we do
not need this comment, and we certainly don't need a reference to the
GRUB 0.97 source code in a Solaris tree.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:02:14 -07:00
Richard Laager 7886aa8a79 Reword the dedup limitation for Edon-R
The old wording was effectively "You can not use this (except you can)",
which just seems confusing.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:02:07 -07:00
Richard Laager 393363c5ec Consistently captialize GUID for features
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:01:51 -07:00
Richard Laager 612c4930dd Fix the spelling of deferred
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:01:45 -07:00
Richard Laager c349137a7e Update zdb.8's fsck reference
On Linux, this is in man section 8, not 1M.  Also, there is no fsdb on
Linux, so I removed that.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:01:38 -07:00
Richard Laager 9042ca0e94 Refer to commands consistently in zpool-features.5
This had a mix of command vs subcommand, quoted vs not quoted, and
bolded vs. not bolded command names.

Also, fix man page sections from 1M (Solaris) to 8 (Linux).

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 10:01:31 -07:00
Richard Laager 9810410a53 Eliminate most mentions of "special"
Previously, the "spare" vdev type was described as "A special
pseudo-vdev which...".  I wanted to eliminate the word "special" from
that, now that the allocation_classes feature exists and there is such a
thing as a "special vdev".  I ended up eliminating almost all instances
of the word "special" that are not referencing the allocation_classes
feature.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8626
2019-04-16 09:59:57 -07:00
Tom Caputi c2c6eadf29 Fix issues with truncated files in raw sends
When receiving a raw send stream only reallocated objects
whose contents were not freed by the standard indicators
should call dmu_free_long_range().

Furthermore, if calling dmu_free_long_range() is required
then the objects current block size must be used and not
the new block size.

Two additional test cases were added to provided realistic
test coverage for processing reallocated objects which are
part of a raw receive.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8528
Closes #8607
2019-04-15 15:28:48 -07:00
Richard Laager 83472fabe5 Fix hierarchy misspellings
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8563
Closes #8622
2019-04-14 19:06:34 -07:00
Tomohiro Kusumi 703f791d35 Don't hard-code number of ioctls for portability
Use (ZFS_IOC_LAST - ZFS_IOC_FIRST) instead of 256.
It seems 256 is just a number large enough to hold ioctls
at the moment.

Using 256 also causes compile-time warning or error
on platfoms whose enum zfs_ioc definition differs.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8598
2019-04-14 11:13:34 -07:00
Tomohiro Kusumi 96e51d2773 Sync reserved Illumos ioctl comment with actual number
It's 81 now.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8598
2019-04-14 11:12:07 -07:00
Richard Elling d5d2ef2b26 compile with -fno-omit-frame-pointer
By default, depending on the version, gcc can reuse the frame pointer register.
This is a micro-optimization that might help on some very old x86 processors.
However, it also makes dynamic tracing less useful because the stacks cannot
be easily observed.

This rule change instructs gcc to use the -fno-omit-frame-pointer option
when compiling.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #8617
2019-04-14 11:04:54 -07:00
Tom Caputi 7dcd318832 Cleanup nits from ab7615d92
This patch simply up cleans up a nit and corrects an error message
issue that were introduced in the Multiple DVA scrub patch.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8619
2019-04-14 11:03:06 -07:00
Brian Behlendorf b92f5d9f82 Fix issue in receive_object() during reallocation
When receiving an object to a previously allocated interior slot
the new object should be "allocated" by setting DMU_NEW_OBJECT,
not "reallocated" with dnode_reallocate().  For resilience verify
the slot is free as required in case the stream is malformed.

Add a test case to generate more realistic incremental send streams
that force reallocation to occur during the receive.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8067 
Closes #8614
2019-04-12 14:28:04 -07:00
Brian Behlendorf 3fa93bb8d3 Fix TXG_MASK cstyle
Fix style issue for 'tx->tx_txg&TXG_MASK'.  There should be white
space around the '&' character.  Split the dnode_reallocate() ASSERT
to make it more readable to clearly separate the checks.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8606
2019-04-12 11:30:59 -07:00
John Wren Kennedy 9e3485abfc ZTS: Make fault cleanup function more robust
The cleanup function of auto_online_001_pos does not account for the
possibility that the test may fail while a disk is still removed. If
the test run is using real disks, cleanup should involve restoring any
that are missing.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Kennedy <john.kennedy@delphix.com>
Closes #8579
2019-04-12 10:07:20 -07:00
Alek P b31cf30a15 Allow zfs-tests to recover from hibernation
When a system sleeps during a zfs-test, the time spent
hibernating is counted against the test's runtime even
though the test can't and isn't running.
This patch tries to detect timeouts due to hibernation and
reruns tests that timed out due to system sleeping.
In this version of the patch, the existing behavior of returning
non-zero when a test was killed is preserved. With this patch applied
we still return nonzero and we also automatically rerun the test we
suspect of being killed due to system hibernation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #8575
2019-04-11 10:20:37 -07:00
Jorgen Lundman 48ed0f9da0 Always call rw_init in zio_crypt_key_unwrap
The error path in zio_crypt_key_unwrap would call zio_crypt_key_destroy which
calls rw_destroy(&key->zk_salt_lock); which has not yet been initialized.

We move the rw_init() call to the start of zio_crypt_key_unwrap instead.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #8604
Closes #8605
2019-04-10 15:39:40 -07:00
Tim Chase 8cb34421e0 Avoid stack overwrite in zfs_setattr_dir()
The bulk[] array index, count, must be reset per-iteration in order to
not overwrite the stack.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #8072
Closes #8597
Closes #8601
2019-04-10 15:38:21 -07:00
Tomohiro Kusumi 5ae4e4481e Don't assume pthread_t is uint_t for portability
POSIX doesn't define pthread_t as uint_t. It could be a pointer.
This code causes below compile error on a platform using pointer
for pthread_t.

--
kernel.c:815:25: error: cast from pointer to integer of different size [-Werror=pointer-to-int-cast]
    (void) printf("%u ", (uint_t)pthread_self());

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8558
2019-04-09 20:49:03 -07:00
Tomohiro Kusumi 9a65234c8b Unbreak build on Linux kernel < 3.10
d12614521a("Fixes for procfs files backed by linked lists")
uses PDE_DATA(), but since PDE_DATA() (public interface which
replaced old public interface PDE()) first appeared in upstream
kernel 3.10, it lacks visible local definition for kernel < 3.10.

Move the local PDE_DATA() definition to a ZoL header, to unbreak
build on kernel < 3.10.

--
module/spl/spl-procfs-list.c: In function 'procfs_list_open':
module/spl/spl-procfs-list.c:166: error: implicit declaration of function 'PDE_DATA'
module/spl/spl-procfs-list.c:166: warning: assignment makes pointer from integer without a cast

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8599
2019-04-08 14:59:24 -07:00
Brian Behlendorf c375c69eca Fix 'zfs list -t snapshot' depth
Commit df583073 introduced the ability to list the snapshots for a
specified dataset.  This change inadvertently resulted in only the top-
level snapshots being listed when no dataset was specified.  Fix this
issue by adding an additional check to determine if a dataset was
provided to avoid incorrectly restricting the depth.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8591 
Closes #8594
2019-04-08 09:14:45 -07:00
Brian Behlendorf ac4985e48d Fix buffer length in strlcpy()
The length used for the strlcpy() used the size of zv_value
when it should have used the size of zc_name.  Correct this
typo.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8595
Closes #8596
2019-04-08 09:10:59 -07:00
Brian Behlendorf d93d4b1acd Revert "Fix issues with truncated files in raw sends"
This partially reverts commit 5dbf8b4ed.  This change resolved
the issues observed with truncated files in raw sends.  However,
the required changes to dnode_allocate() introduced a regression
for non-raw streams which needs to be understood.

The additional debugging improvements from the original patch
were not reverted.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7378
Issue #8528
Issue #8540
Issue #8565
Close #8584
2019-04-05 17:32:56 -07:00
Matthew Ahrens 944a37248a predictive prefetch disabled on new pools until export/reboot
When a pool is initially created (by `zpool create`), predictive
prefetch is inadvertently disabled, until the pool is export/import-ed,
or the machine is rebooted.

When device removal was introduced, we added some code to disable
predictive prefetching until indirect vdevs have been loaded.  This
resulted in the "default state" of prefetch being disabled, until we
proactively enable it after indirect vdevs are loaded.  Unfortunately
this resulted in a few bugs where in some code paths we neglect to
enable predictive prefetch.  The first of these was fixed by
https://github.com/zfsonlinux/zfs/commit/20507534d4ede14d4dd82c99fc8d461704ce7419

This commit fixes another case where we also need to explicitly enable
predictive prefetch, when the pool is initially created.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8577
2019-04-05 10:12:02 -07:00
Don Brady b4ddec7af6 features.kernel layout should match features.pool
The features.kernel layout should match features.pool.

Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #8566
2019-04-04 19:00:55 -07:00
Sara Hartse a887d653b3 Restrict kstats and print real pointers
There are several places where we use zfs_dbgmsg and %p to
print pointers. In the Linux kernel, these values obfuscated
to prevent information leaks which means the pointers aren't
very useful for debugging crash dumps. We decided to restrict
the permissions of dbgmsg (and some other kstats while we were
at it) and print pointers with %px in zfs_dbgmsg as well as
spl_dumpstack

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
Closes #8467 
Closes #8476
2019-04-04 18:57:06 -07:00
Josh Soref af65079300 Hint about zpool free vs zfs available
Also describe free/allocated/fragmentation

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>
Closes #7565 
Closes #8483
2019-04-04 10:00:16 -07:00
Brian Behlendorf f4e35b165c Fix txg_wait_open() load average inflation
Callers of txg_wait_open() which set should_quiesce=B_TRUE should be
accounted for as iowait time.  Otherwise, the caller is understood
to be idle and cv_wait_sig() is used to prevent incorrectly inflating
the system load average.

Similarly txg_wait_wait() has been updated to use cv_wait_io() to
be accounted against iowait.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8550
Closes #8558
2019-04-04 09:44:46 -07:00
Michael Niewöhner ce4432c542 Move dracut specifics to dracut module
Dracut depends on the environment variable BOOTFS to be set after pool
import. This dracut specific systemd ExecStartPost command should not be
called for any non-dracut systems, so let's move it to a static systemd
unit that.

Reviewed-by: Manuel Amador (Rudd-O) <rudd-o@rudd-o.com>
Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #8510
2019-04-02 17:14:39 -07:00
Josh Soref f72ecb8d27 Fix man(1) warnings
The macOS man app strenuously objects to blank lines in man files.

mdoc warning: Empty input line #xyz

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Josh Soref <jsoref@users.noreply.github.com>
Closes #8559
2019-04-02 11:05:09 -07:00
TerraTech bd15ac764f Append snapshot name to "TIME SENT SNAPSHOT" output
Simply appends zhp->zfs_name to the "TIME SENT SNAPSHOT" output.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: TerraTech <TerraTech@users.noreply.github.com>
Closes #8543
2019-04-01 12:25:17 -07:00
Tom Caputi df583073eb Do not iterate through filesystems unnecessarily
Currently, when attempting to list snapshots ZFS may do a lot of
extra work checking child datasets. This is because the code does
not realize that it will not be able to reach any snapshots
contained within snapshots that are at the depth limit since the
snapshots of those datasets are counted as an additional layer
deeper. This patch corrects this issue.

In addition, this patch adds the ability to do perform the commands:

$ zfs list -t snapshot <dataset>
$ zfs get -t snapshot <prop> <dataset>

as a convenient way to list out properties of all snapshots of a
given dataset without having to use the depth limit.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8539
2019-04-01 11:58:59 -07:00
Michael Niewöhner e03b25a564 Fix systemd-import services
On debian, systemd complains about missing /bin/awk because it
actually is located at /usr/bin/awk. It is not a good idea to
hardcode binary paths because different linux distros use different
paths. According to systemd's man page it is absolutely safe to
miss paths for binaries located at standard locations (/bin,
/sbin, /usr/bin, ...).

Further, replace this more or less complicated awk command by
grep.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Issue #8510
2019-03-29 15:17:23 -07:00
Michael Niewöhner 3b2618927c Remove hard dependency on bash
zfs-import-* services have a hard dependency on bash while not
everyone has bash installed. At this point /bin/sh is sufficient,
so use that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Issue #8510
2019-03-29 15:16:58 -07:00
Tom Caputi dd29864b01 Update raw send documentation
This patch simply clarifies some of the limitations related to
raw sends in the man page. No functional changes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jason Cohen <jwittlincohen@gmail.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8503
Closes #8544
2019-03-29 15:05:55 -07:00
Brian Behlendorf 1b939560be Add TRIM support
UNMAP/TRIM support is a frequently-requested feature to help
prevent performance from degrading on SSDs and on various other
SAN-like storage back-ends.  By issuing UNMAP/TRIM commands for
sectors which are no longer allocated the underlying device can
often more efficiently manage itself.

This TRIM implementation is modeled on the `zpool initialize`
feature which writes a pattern to all unallocated space in the
pool.  The new `zpool trim` command uses the same vdev_xlate()
code to calculate what sectors are unallocated, the same per-
vdev TRIM thread model and locking, and the same basic CLI for
a consistent user experience.  The core difference is that
instead of writing a pattern it will issue UNMAP/TRIM commands
for those extents.

The zio pipeline was updated to accommodate this by adding a new
ZIO_TYPE_TRIM type and associated spa taskq.  This new type makes
is straight forward to add the platform specific TRIM/UNMAP calls
to vdev_disk.c and vdev_file.c.  These new ZIO_TYPE_TRIM zios are
handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs.
This makes it possible to largely avoid changing the pipieline,
one exception is that TRIM zio's may exceed the 16M block size
limit since they contain no data.

In addition to the manual `zpool trim` command, a background
automatic TRIM was added and is controlled by the 'autotrim'
property.  It relies on the exact same infrastructure as the
manual TRIM.  However, instead of relying on the extents in a
metaslab's ms_allocatable range tree, a ms_trim tree is kept
per metaslab.  When 'autotrim=on', ranges added back to the
ms_allocatable tree are also added to the ms_free tree.  The
ms_free tree is then periodically consumed by an autotrim
thread which systematically walks a top level vdev's metaslabs.

Since the automatic TRIM will skip ranges it considers too small
there is value in occasionally running a full `zpool trim`.  This
may occur when the freed blocks are small and not enough time
was allowed to aggregate them.  An automatic TRIM and a manual
`zpool trim` may be run concurrently, in which case the automatic
TRIM will yield to the manual TRIM.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Contributions-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Contributions-by: Tim Chase <tim@chase2k.com>
Contributions-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8419 
Closes #598
2019-03-29 09:13:20 -07:00
Tom Caputi f94b3cbf43 Send stream should only list included snaps
Currently, zfs send streams will include a list of all snapshots
on the source side if the '-p' option is provided. This can cause
performance problems on the receive side, especially if those
snapshots aren't present on the destination. These problems arise
because guid_to_name(), which is used for several receive side
functions, will search the entire receive-side pool if it can't
find a snapshot with a matching guid. This patch corrects the
issue by ensuring only streams that require this list of snapshots
include them.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8533
2019-03-28 15:48:58 -07:00
Tom Caputi 5dbf8b4edd Fix issues with truncated files in raw sends
This patch fixes a few issues with raw receives involving
truncated files:

* dnode_reallocate() now calls dnode_set_blksz() instead of
  dnode_setdblksz(). This ensures that any remaining dbufs with
  blkid 0 are resized along with their containing dnode upon
  reallocation.

* One of the calls to dmu_free_long_range() in receive_object()
  needs to check that the object it is about to free some contents
  or hasn't been completely removed already by a previous call to
  dmu_free_long_object() in the same function.

* The same call to dmu_free_long_range() in the previous point
  needs to ensure it uses the object's current block size and
  not the new block size. This ensures the blocks of the object
  that are supposed to be freed are completely removed and not
  simply partially zeroed out.

This patch also adds handling for DRR_OBJECT_RANGE records to
dprintf_drr() for debugging purposes.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7378 
Closes #8528
2019-03-27 11:30:48 -07:00
Richard Elling 85a150ce1e Update valid vdev types for get_disklist
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #8532
2019-03-26 14:18:58 -07:00
Brian Behlendorf db6be852da ZTS: Detect e2fsprogs verity issue
The projectid_001_pos and projecttree_001_pos test cases use the lsattr
command to detect that the project quota bit is set correctly.  Due to
a bug in e2fsprogs-1.44.4 setting the Project 'P' bit also results in
the Verity 'V' bit being reported as set.  This will result in the test
case failing.

The issue has been resolved in e2fsprogs but in order to avoid testing
failures these two test cases are skipped when e2fsprogs-1.44.4 is
installed.

https://github.com/tytso/e2fsprogs/commit/7e5a95e3d

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8534
2019-03-26 13:57:40 -07:00
Evan Allrich 74580a9411 Correct a very minor grammar issue
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Evan Allrich <evan@unguku.com>
Closes #8535
2019-03-26 12:27:29 -07:00
Richard Elling fc16b4f4c8 git ignore python 3.7 virtual environment directories
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #8525
2019-03-25 15:05:26 -07:00
Igor K c048ddaf33 Fix vd_path and error in spa_vdev_remove()
Make a local copy of the vd_path and preserve the removal error
for use in spa_history_log_internal().  This is required because
after spa_vdev_exit() there is nothing preventing the vdev state
from changing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes #8522
2019-03-22 13:25:07 -07:00
Roman Strashkin 234234ca4d Panic when running 'zpool split'
Added missing remove of detachable VDEV from txg's DTL list
to avoid use-after-free for the split VDEV

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com>
Closes #5565 
Closes #7856
2019-03-22 13:11:36 -07:00
George Wilson 2efea7c82c ZFS Reads may result in unneccesary calls to zil_commit
ZFS supports O_RSYNC for read operations and when specified will ensure
the same level of data integrity that O_DSYNC and O_SYNC provides for
writes. O_RSYNC by itself has no effect so it must be combined with
either O_DSYNC or O_SYNC. However, many platforms don't support O_RSYNC
and have mapped O_SYNC to mean O_RSYNC within ZFS. This is incorrect
and causes unnecessary calls to zil_commit. Only platforms which
support O_RSYNC should implement the zil_commit functionality in the
read code path.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #8523
2019-03-22 13:09:11 -07:00
Olaf Faaland 060f0226e6 MMP interval and fail_intervals in uberblock
When Multihost is enabled, and a pool is imported, uberblock writes
include ub_mmp_delay to allow an importing node to calculate the
duration of an activity test.  This value, is not enough information.

If zfs_multihost_fail_intervals > 0 on the node with the pool imported,
the safe minimum duration of the activity test is well defined, but does
not depend on ub_mmp_delay:

zfs_multihost_fail_intervals * zfs_multihost_interval

and if zfs_multihost_fail_intervals == 0 on that node, there is no such
well defined safe duration, but the importing host cannot tell whether
mmp_delay is high due to I/O delays, or due to a very large
zfs_multihost_interval setting on the host which last imported the pool.
As a result, it may use a far longer period for the activity test than
is necessary.

This patch renames ub_mmp_sequence to ub_mmp_config and uses it to
record the zfs_multihost_interval and zfs_multihost_fail_intervals
values, as well as the mmp sequence.  This allows a shorter activity
test duration to be calculated by the importing host in most situations.
These values are also added to the multihost_history kstat records.

It calculates the activity test duration differently depending on
whether the new fields are present or not; for importing pools with
only ub_mmp_delay, it uses

(zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals

Which results in an activity test duration less sensitive to the leaf
count.

In addition, it makes a few other improvements:
* It updates the "sequence" part of ub_mmp_config when MMP writes
  in between syncs occur.  This allows an importing host to detect MMP
  on the remote host sooner, when the pool is idle, as it is not limited
  to the granularity of ub_timestamp (1 second).
* It issues writes immediately when zfs_multihost_interval is changed
  so remote hosts see the updated value as soon as possible.
* It fixes a bug where setting zfs_multihost_fail_intervals = 1 results
  in immediate pool suspension.
* Update tests to verify activity check duration is based on recorded
  tunable values, not tunable values on importing host.
* Update tests to verify the expected number of uberblocks have valid
  MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number),
  that sequence number is incrementing, and that uberblock values match
  tunable settings.

Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7842
2019-03-21 12:47:57 -07:00
Jorgen Lundman d10b2f1d35 Mutex leak in dsl_dataset_hold_obj()
In addition to dsl_dataset_evict_async() releasing a hold, there is
an error case in dsl_dataset_hold_obj() which had missed 4 additional
release calls.  This was introduced in a1d477c24.

openzfsonosx-commit: https://github.com/openzfsonosx/zfs/commit/63ff7f1c

Authored by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8517
2019-03-21 10:36:58 -07:00
cfzhu 45001b949c QAT: Allocate digest_buffer using QAT_PHYS_CONTIG_ALLOC()
If the buffer 'digest_buffer' is allocated in the qat_checksum()
stack, it can't ensure that the address is physically contiguous,
and the DMA result of the buffer may be handled incorrectly.
Using QAT_PHYS_CONTIG_ALLOC() ensures a physically
contiguous allocation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chengfei, Zhu <chengfeix.zhu@intel.com>
Closes #8323 
Closes #8521
2019-03-21 10:35:18 -07:00
Brian Behlendorf ec4f9b8f30 Report holes when there are only metadata changes
Update the dirty check in dmu_offset_next() such that dnode's
are only considered dirty for the purpose or reporting holes
when there are pending data blocks or frees to be synced.  This
ensures that when there are only metadata updates to be synced
(atime) that holes are reported.

Reviewed-by: Debabrata Banerjee <dbanerje@akamai.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6958 
Closes #8505
2019-03-21 10:30:15 -07:00
Brian Behlendorf 066da71e7f Improve zpool labelclear
1) As implemented the `zpool labelclear` command overwrites
the calculated offsets of all four vdev labels even when only a
single valid label is found.  If the device as been re-purposed
but still contains a valid label this can result in space no
longer owned by ZFS being zeroed.  Prevent this by verifying
every label removed is intact before it's overwritten.

2) Address a small bug in zpool_do_labelclear() which prevented
labelclear from working on file vdevs.  Only block devices support
BLKFLSBUF, try the ioctl() but when it's reported as unsupported
this should not be fatal.

3) Fix `zpool labelclear` so it can be run on vdevs which were
removed from the pool with `zpool remove`.  Additionally, allow
intact but partial labels to be cleared as in the case of a failed
`zpool attach` or `zpool replace`.

4) Remove LABELCLEAR and LABELREAD variables for test cases.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8500 
Closes #8373 
Closes #6261
2019-03-21 10:13:01 -07:00
Julian Heuking 304d469dcd Add missing dmu_zfetch_fini() in dnode_move_impl()
As it turns out, on the Windows platform when rw_init() is called
(rather its bedrock call ExInitializeResourceLite) it is placed on
an active-list of locks, and is removed at rw_destroy() time.

dnode_move() has logic to copy over the old-dnode to new-dnode,
including calling dmu_zfetch_init(new-dnode). But due to the missing
dmu_zfetch_fini(old-dnode), kmem will call dnode_dest() to release the
memory (and in debug builds fill pattern 0xdeadbeef) over the Windows
active-lock's prev/next list pointers, making Windows sad.

But on other platforms, the contents of dmu_zfetch_fini() is one
call to list_destroy() and one to rw_destroy(), which is effectively
a no-op call and is not required. This commit is mostly for
"correctness" and can be skipped there.

Porting Notes:
* This leak exists on Linux but currently can never happen because
  the dnode_move() functionality is not supported.

openzfsonosx-commit: openzfsonosx/zfs@d95fe517

Authored by: Julian Heuking <JulianH@beckhoff.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Closes #8519
2019-03-20 15:06:55 -07:00
Tom Caputi 73c25a78e6 Add space in error message
This patch simply adds a missing space in the
ZFS_ERR_FROM_IVSET_GUID_MISSING error message.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8514
2019-03-19 10:22:39 -07:00
Brian Behlendorf ca6c7a94c9 Fix l2arc_evict() destroy race
When destroying an arc_buf_hdr_t its identity cannot be discarded
until it is entirely undiscoverable.  This not only includes being
unhashed, but also being removed from the l2arc header list.
Discarding the header's identify prematurely renders the hash
lock useless because it will always hash to bucket zero.

This change resolves a race with l2arc_evict() by discarding the
identity after it has been removed from the l2arc header list.
This ensures either the header is not on the list or contains
the correct identify.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7688 
Closes #8144
2019-03-15 14:17:38 -07:00
Tom Caputi ab7615d92c Multiple DVA Scrubbing Fix
Currently, there is an issue in the sequential scrub code which
prevents self healing from working in some cases. The scrub code
will split up all DVA copies of a bp and issue each of them
separately. The problem is that, since each of the DVAs is no
longer associated with the others, the self healing code doesn't
have the opportunity to repair problems that show up in one of the
DVAs with the data from the others.

This patch fixes this issue by ensuring that all IOs issued by the
sequential scrub code include all DVAs. Initially, only the first
DVA of each is attempted. If an issue arises, the IO is retried
with all available copies, giving the self healing code a chance
to correct the issue.

To test this change, this patch also adds the ability for zinject
to specify individual DVAs to inject read errors into. We then
add a new test case that utilizes this functionality to ensure
scrubs and self-healing reads can handle and transparently fix
issues with individual copies of blocks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8453
2019-03-15 14:14:31 -07:00
Tony Hutter 2bbec1c910 Make zpool status counters match error events count
The number of IO and checksum events should match the number of errors
seen in zpool status.  Previously there was a mismatch between the
two counts because zpool status would only count unrecovered errors,
while zpool events would get an event for *all* errors (recovered or
not).  This lead to situations where disks could be faulted for
"too many errors", while at the same time showing zero errors in zpool
status.

This fixes the zpool status error counters to increment at the same
times we post the error events.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #4851 
Closes #7817
2019-03-14 18:21:53 -07:00
Tom Caputi 04a3b0796c Fix memory leaks in zfsvfs_create_impl()
This patch simply fixes some small memory leaks that can happen
during error handling in zfsvfs_create_impl(). If the function
fails, it frees all the memory / references it created.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8490
2019-03-14 18:14:36 -07:00
Henrik Riomar c742bf1e68 zfs-import: should be before swap
zfs-import must be done before swap in order for swap on zvol to work

Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Henrik Riomar <henrik.riomar@gmail.com>
Closes #8502
2019-03-14 18:12:17 -07:00
Tom Caputi eaed840542 Better user experience for errata 4
This patch attempts to address some user concerns that have arisen
since errata 4 was introduced.

* The errata warning has been made less scary for users without
  any encrypted datasets.

* The errata warning now clears itself without a pool reimport if
  the bookmark_v2 feature is enabled and no encrypted datasets
  exist.

* It is no longer possible to create new encrypted datasets without
  enabling the bookmark_v2 feature, thus helping to ensure that the
  errata is resolved.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue ##8308
Closes #8504
2019-03-14 16:48:30 -07:00
kpande 98310e5d1a Update commented zed.rc values to defaults
Update zed.rc values reflect their default value.  This helps
avoid confusion if a user expects functionality to be enabled.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Kash Pande <kash@tripleback.net>
Closes #8498
2019-03-14 09:53:34 -07:00
Igor K 508c5527d0 Use 'printf %s' instead of 'echo -n' for compatibility
The ksh 'echo -n' behavior on Illumos and Linux differs.  For
compatibility with others platforms switch to "printf '%s' ".

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes #8501
2019-03-13 18:39:12 -07:00
Alexander Motin 1af240f3b5 Add separate aggregation limit for non-rotating media
Before sequential scrub patches ZFS never aggregated I/Os above 128KB.
Sequential scrub bumped that to 1MB, supposedly to reduce number of
head seeks for spinning disks.  But for SSDs it makes little to no
sense, especially on FreeBSD, where due to MAXPHYS limitation device
will likely still see bunch of 128KB I/Os instead of one large.
Having more strict aggregation limit for SSDs allows to avoid
allocation of large memory buffer and copy to/from it, that is a
serious problem when throughput reaches gigabytes per second.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Closes #8494
2019-03-13 12:00:10 -07:00
kpande 12a935ee9c Update CONTRIBUTING to point users to IRC as well as mailing list
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Kash Pande <kash@tripleback.net>
Closes #8466
2019-03-13 11:57:57 -07:00
Tom Caputi 5cc9ba5cf0 Make zstreamdump -v more greppable
Currently, the verbose output of zstreamdump includes new line
characters within some individual records. Presumably, this was
originally done to keep the output from getting too wide to fit
on a terminal. However, since new flags and struct members have
been added, these rules have not been maintained consistently. In
addition, these newlines can make it hard to grep the output in
some scenarios. This patch simply removes these newlines, making
the output easier to grep and removing the inconsistency.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8493
2019-03-13 11:19:23 -07:00
Andrew Stormont 1814242379 OpenZFS 9914 - NV_UNIQUE_NAME_TYPE broken after 9580
Authored by: Andrew Stormont <astormont@racktopsystems.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9914
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/b8a5bee18
Closes #8496
2019-03-13 11:16:30 -07:00
Tom Caputi f00ab3f22c Detect and prevent mixed raw and non-raw sends
Currently, there is an issue in the raw receive code where
raw receives are allowed to happen on top of previously
non-raw received datasets. This is a problem because the
source-side dataset doesn't know about how the blocks on
the destination were encrypted. As a result, any MAC in
the objset's checksum-of-MACs tree that is a parent of both
blocks encrypted on the source and blocks encrypted by the
destination will be incorrect. This will result in
authentication errors when we decrypt the dataset.

This patch fixes this issue by adding a new check to the
raw receive code. The code now maintains an "IVset guid",
which acts as an identifier for the set of IVs used to
encrypt a given snapshot. When a snapshot is raw received,
the destination snapshot will take this value from the
DRR_BEGIN payload. Non-raw receives and normal "zfs snap"
operations will cause ZFS to generate a new IVset guid.
When a raw incremental stream is received, ZFS will check
that the "from" IVset guid in the stream matches that of
the "from" destination snapshot. If they do not match, the
code will error out the receive, preventing the problem.

This patch requires an on-disk format change to add the
IVset guids to snapshots and bookmarks. As a result, this
patch has errata handling and a tunable to help affected
users resolve the issue with as little interruption as
possible.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8308
2019-03-13 11:00:43 -07:00
Tom Caputi 579ce7c5ae Add bookmark v2 on-disk feature
This patch adds the bookmark v2 feature to the on-disk format. This
feature will be needed for the upcoming redacted sends and for an
upcoming fix that for raw receives. The feature is not currently
used by any code and thus this change is a no-op, aside from the
fact that the user can now enable the feature.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #8308
2019-03-13 10:58:39 -07:00
Tom Caputi 369aa501d1 Fix handling of maxblkid for raw sends
Currently, the receive code can create an unreadable dataset from
a correct raw send stream. This is because it is currently
impossible to set maxblkid to a lower value without freeing the
associated object. This means truncating files on the send side
to a non-0 size could result in corruption. This patch solves this
issue by adding a new 'force' flag to dnode_new_blkid() which will
allow the raw receive code to force the DMU to accept the provided
maxblkid even if it is a lower value than the existing one.

For testing purposes the send_encrypted_files.ksh test has been
extended to include a variety of truncated files and multiple
snapshots. It also now leverages the xattrtest command to help
ensure raw receives correctly handle xattrs.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8168 
Closes #8487
2019-03-13 10:52:01 -07:00
jwittlincohen 146bdc414c Fix typo in arc_summary3
This is a simple fix for a typo ("perfetch" rather than "prefetch")
in arc_summary3.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Jason Cohen <jwittlincohen@gmail.com>
Closes #8499
2019-03-13 10:43:55 -07:00
Olaf Faaland db2af93d72 Increase default zfs_multihost_fail_intervals and import_intervals
By default, when multihost is enabled for a pool, the pool is
suspended if (zfs_multihost_fail_intervals*zfs_multihost_interval) ms
pass without a successful MMP write.  This is the recommended
configuration.

The default value for zfs_multihost_fail_intervals has been 5, and the
default value for zfs_multihost_interval has been 1000, so pool
suspension occurred at 5 seconds.

There have been multiple cases where a single misbehaving device in a
pool triggered a SCSI reset, and all I/O paused for 5-6 seconds.  This
in turn caused MMP to suspend the pool.

In the cases observed, the rest of the devices were healthy and the
pool was otherwise correctly performing I/O.  The reset was handled
correctly by ZFS, and by suspending the pool MMP made replacing the
device more difficult as well as forcing the host to be rebooted.

Increase the default value of zfs_multihost_fail_intervals to 10, so
that MMP tolerates up to 10 seconds of failed MMP writes before
suspending the pool.

Increase the default value of zfs_multihost_import_intervals to 20, to
maintain the 2:1 safety factor.  This results in a force import taking
approximately 20 seconds when MMP is enabled, with default values.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7709 
Closes #8495
2019-03-13 09:50:48 -07:00
Justin Gottula cffa8372f4 Fix most zfs_arc_* mod params not actually being modifiable at runtime
Most of the zfs_arc_* module parameters do not have their values used by
the ARC code directly. Instead, there is a function, arc_tuning_update,
which is called during module initialization and periodically
thereafter, whose job is to fetch the module parameter values, clamp/
limit them appropriately, and then assign those values to a separate set
of internal variables that are actually referenced by the ARC code.

Commit 3ec34e55 featured an overhaul of arc_reclaim_thread, which is the
former location where the post-init-time calls to arc_tuning_update
would occur. The rework split the work previously done by the
arc_reclaim_thread into a pair of replacement threads; and
unfortunately, the call to arc_tuning_update fell through the cracks and
was lost in the reorganization.

This meant that changing almost any ARC-related zfs module parameter via
/sys/module/zfs/parameters/ would result in the module parameter value
itself appearing to change; however the modification would not actually
propagate to the ARC code and have any real effect.

This commit reinstates the post-init-time call to arc_tuning_update. It
is now called during arc_adjust_cb_check; this should be equivalent to
its former call location in arc_reclaim_thread.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Justin Gottula <justin@jgottula.com>
Closes #8405 
Closes #8463
2019-03-12 15:03:59 -07:00
Alek P 4c0883fb4a Avoid retrieving unused snapshot props
This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it
to take input parameters that alter the way looping through the list of
snapshots is performed. The idea here is to restrict functions that
throw away some of the snapshots returned by the ioctl to a range of
snapshots that these functions actually use. This improves efficiency
and execution speed for some rollback and send operations.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #8077
2019-03-12 13:13:22 -07:00
Brian Behlendorf dd785b5b86 Fix vdev_initialize_restart / removal race
Resolve a vdev_initialize crash uncovered by ztest.  Similar
to when starting a new initialization verify that a removal
is not in progress.  Additionally, do not restart when the
thread already exists.  This check is now congruent with the
POOL_INITIALIZE_DO handling in spa_vdev_initialize_impl().

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8477
2019-03-12 10:39:47 -07:00
Olaf Faaland 3d31aad83e MMP writes rotate over leaves
Instead of choosing a leaf vdev quasi-randomly, by starting at the root
vdev and randomly choosing children, rotate over leaves to issue MMP
writes.  This fixes an issue in a pool whose top-level vdevs have
different numbers of leaves.

The issue is that the frequency at which individual leaves are chosen
for MMP writes is based not on the total number of leaves but based on
how many siblings the leaves have.

For example, in a pool like this:

       root-vdev
   +------+---------------+
vdev1                   vdev2
  |                       |
  |                +------+-----+-----+----+
disk1             disk2 disk3 disk4 disk5 disk6

vdev1 and vdev2 will each be chosen 50% of the time.  Every time vdev1
is chosen, disk1 will be chosen.  However, every time vdev2 is chosen,
disk2 is chosen 20% of the time.  As a result, disk1 will be sent 5x as
many MMP writes as disk2.

This may create wear issues in the case of SSDs.  It also reduces the
effectiveness of MMP as it depends on the writes being evenly
distributed for the case where some devices fail or are partitioned.

The new code maintains a list of leaf vdevs in the pool.  MMP records
the last leaf used for an MMP write in mmp->mmp_last_leaf.  To choose
the next leaf, MMP starts at mmp->mmp_last_leaf and traverses the list,
continuing from the head if the tail is reached.  It stops when a
suitable leaf is found or all leaves have been examined.

Added a test to verify MMP write distribution is even.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7953
2019-03-12 10:37:06 -07:00
George Wilson b1b94e9644 zfs does not honor NFS sync write semantics
The linux kernel's nfsd implementation use RWF_SYNC to determine if the
write is synchronous or not. This flag is used to set the kernel's I/O
control block flags. Unfortunately, ZFS was not updated to inspect these
flags so NFS sync writes were not being honored.

This change maps the IOCB_* flags to the ZFS equivalent.

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #8474 
Closes #8452 
Closes #8486
2019-03-11 09:13:37 -07:00
mzhivich 1118f99449 Fix lockdep between ds_lock and dd_lock in dsl_dataset_namelen()
Booting debug kernel found an inconsistent lock dependency between
dataset's ds_lock and its directory's dd_lock.

[ 32.215336] ======================================================
[ 32.221859] WARNING: possible circular locking dependency detected
[ 32.221861] 4.14.90+ #8 Tainted: G           O
[ 32.221862] ------------------------------------------------------
[ 32.221863] dynamic_kernel_/4667 is trying to acquire lock:
[ 32.221864]  (&ds->ds_lock){+.+.}, at: [<ffffffffc10a4bde>] dsl_dataset_check_quota+0x9e/0x8a0 [zfs]
[ 32.221941] but task is already holding lock:
[ 32.221941]  (&dd->dd_lock){+.+.}, at: [<ffffffffc10cd8e9>] dsl_dir_tempreserve_space+0x3b9/0x1290 [zfs]
[ 32.221983] which lock already depends on the new lock.
[ 32.221983] the existing dependency chain (in reverse order) is:
[ 32.221984] -> #1 (&dd->dd_lock){+.+.}:
[ 32.221992] 	__mutex_lock+0xef/0x14c0
[ 32.222049] 	dsl_dir_namelen+0xd4/0x2d0 [zfs]
[ 32.222093] 	dsl_dataset_namelen+0x2f1/0x430 [zfs]
[ 32.222142] 	verify_dataset_name_len+0xd/0x40 [zfs]
[ 32.222184] 	dmu_objset_find_dp_impl+0x5f5/0xef0 [zfs]
[ 32.222226] 	dmu_objset_find_dp_cb+0x40/0x60 [zfs]
[ 32.222235] 	taskq_thread+0x969/0x1460 [spl]
[ 32.222238] 	kthread+0x2fb/0x400
[ 32.222241] 	ret_from_fork+0x3a/0x50

[ 32.222241] -> #0 (&ds->ds_lock){+.+.}:
[ 32.222246] 	lock_acquire+0x14f/0x390
[ 32.222248] 	__mutex_lock+0xef/0x14c0
[ 32.222291] 	dsl_dataset_check_quota+0x9e/0x8a0 [zfs]
[ 32.222355] 	dsl_dir_tempreserve_space+0x5d2/0x1290 [zfs]
[ 32.222392] 	dmu_tx_assign+0xa61/0xdb0 [zfs]
[ 32.222436] 	zfs_create+0x4e6/0x11d0 [zfs]
[ 32.222481] 	zpl_create+0x194/0x340 [zfs]
[ 32.222484] 	lookup_open+0xa86/0x16f0
[ 32.222486] 	path_openat+0xe56/0x2490
[ 32.222488] 	do_filp_open+0x17f/0x260
[ 32.222490] 	do_sys_open+0x195/0x310
[ 32.222491] 	SyS_open+0xbf/0xf0
[ 32.222494] 	do_syscall_64+0x191/0x4f0
[ 32.222496] 	entry_SYSCALL_64_after_hwframe+0x42/0xb7

[ 32.222497] other info that might help us debug this:

[ 32.222497] Possible unsafe locking scenario:
[ 32.222498] CPU0 			CPU1
[ 32.222498] ---- 			----
[ 32.222499] lock(&dd->dd_lock);
[ 32.222500] 				lock(&ds->ds_lock);
[ 32.222502] 				lock(&dd->dd_lock);
[ 32.222503] lock(&ds->ds_lock);
[ 32.222504] *** DEADLOCK ***
[ 32.222505] 3 locks held by dynamic_kernel_/4667:
[ 32.222506] #0: (sb_writers#9){.+.+}, at: [<ffffffffaf68933c>] mnt_want_write+0x3c/0xa0
[ 32.222511] #1: (&type->i_mutex_dir_key#8){++++}, at: [<ffffffffaf652cde>] path_openat+0xe2e/0x2490
[ 32.222515] #2: (&dd->dd_lock){+.+.}, at: [<ffffffffc10cd8e9>] dsl_dir_tempreserve_space+0x3b9/0x1290 [zfs]

The issue is caused by dsl_dataset_namelen() holding ds_lock, followed by
acquiring dd_lock on ds->ds_dir in dsl_dir_namelen().

However, ds->ds_dir should not be protected by ds_lock, so releasing it before
call to dsl_dir_namelen() prevents the lockdep issue

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by:  Michael Zhivich <mzhivich@akamai.com>
Closes #8413
2019-03-11 09:11:04 -07:00
Lorenz Brun bf90948daf Reorder ZFS ioctls to fix cross-version compatibility
Reorder ZFS ioctls to fix cross-version compatibility.

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Lorenz Brun <lorenz@dolansoft.org>
Closes #8484
2019-03-09 13:39:31 -08:00
Brian Behlendorf b46fd243d5 Linux 5.1 compat: get_ds() removed
Commit torvalds/linux@736706bee has removed the get_fs() function
as a bit of cleanup.  It has been defined as KERNEL_DS on all
architectures for all supported kernels.  Replace get_fs() with
KERNEL_DS as was done in the kernel.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8479
2019-03-07 14:44:23 -08:00
Tony Hutter becdcec7b9 kernel_fpu fixes
This patch fixes a few issues when detecting which kernel_fpu functions
are available.

- Use kernel_fpu_begin() if it's exported on newer kernels.

- Use ZFS_LINUX_TRY_COMPILE_SYMBOL() to choose the right kernel_fpu
  function when using --enable-linux-builtin.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8259
Closes #8363
2019-03-06 16:03:03 -08:00
Paul Zuchowski a73e8fdb93 Stack overflow in recursive bpobj_iterate_impl
The function bpobj_iterate_impl overflows the stack when bpobjs
are deeply nested. Rewrite the function to eliminate the recursion.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7674
Closes #7675 
Closes #7908
2019-03-06 09:50:55 -08:00
Brian Behlendorf 96ebc5a1a4 Fix race in vdev_initialize_thread
Before allowing new allocations to the metaslab we need to ensure
that any issued initializing writes have been synced.  Otherwise,
it's possible for metaslab_block_alloc() to allocate a range which
is about to be overwritten by an initializing IO.

Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8461
2019-03-06 09:17:53 -08:00
Rafael Kitover 762f9ef3d9 config: better libtirpc detection
Improve the autoconf code for finding libtirpc and do not assume the
headers are in /usr/include/tirpc.

Also remove this assumption from the `rpc/xdr.h` header in libspl and
use the same `#include_next` mechanism that is used for other libspl
headers.

Include pkg.m4 from pkg-config in config/ for PKG_CHECK_MODULES(), the
file license allows this.

Include ax_save_flags.m4 and ax_restore_flags.m4 from autoconf-archive,
the file licenses are compatible. Use the 2012 versions so as not rely
on a more recent autoconf feature AS_VAR_COPY(), which breaks some build
slaves.

Add new macro library `config/find_system_library.m4` which defines the
FIND_SYSTEM_LIBRARY() macro which is a convenience wrapper over using
PKG_CHECK_MODULES() with a fallback to standard library locations and
some sanity checks.

The parameters are:

```
FIND_SYSTEM_LIBRARY(VARIABLE-PREFIX, MODULE, HEADER, HEADER-PREFIXES,
    LIBRARY, FUNCTIONS, [ACTION-IF-FOUND], [ACTION-IF-NOT-FOUND])
```

`HEADER-PREFIXES` and `FUNCTIONS` are comma-separated m4 lists.

For libtirpc we are using:

```
FIND_SYSTEM_LIBRARY(LIBTIRPC, [libtirpc], [rpc/xdr.h], [tirpc], [tirpc],
    [xdrmem_create], [], [...])
```

The headers are first checked for without the prefixes and then with.

This system works with pkg-config and falls back on checking standard
header/library locations, it can be easily overridden by the user by
setting the `PREFIX_CFLAGS` and `PREFIX_LIBS` variables which are
automatically added to the `./configure --help` output.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rafael Kitover <rkitover@gmail.com>
Closes #7422 
Closes #8313
2019-03-02 16:19:05 -08:00
Matthew Ahrens 0409679d88 Fix style of spl_kmem_cache_create()
Fix indentation of code in ifdef's.
Remove obsolete comment.
Make if/else statements more readable by adding braces.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8459
2019-02-28 17:57:47 -08:00
Olaf Faaland 8133679ff0 Do not resume a pool if multihost is enabled
When multihost is enabled, and a pool is suspended, return
EINVAL in response to "zpool clear <pool>".  The pool
may have been imported on another host while I/O was suspended.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6933 
Closes #8460
2019-02-28 17:56:19 -08:00
Olaf Faaland 4f3218aed8 Warn user about accidentally sharing devices
Improve the man page text to warn the user about the risk of adding
the same device to multiple pools via simultaneous "zpool create",
"zpool add", "zpool replace", etc.

State that MMP/multihost does not protect against these scenarios.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6473 
Closes #8457
2019-02-28 17:54:36 -08:00
Matthew Ahrens 87c25d567f abd_alloc should use scatter for >1K allocations
abd_alloc() normally does scatter allocations, thus solving the problem
that ABD originally set out to: the bulk of ZFS's allocations are single
pages, which are faster to allocate and free, and don't suffer from
internal fragmentation (and the inability to reclaim memory because some
buffers in the slab are still allocated).

However, the current code does linear allocations for 4KB and smaller
allocations, defeating the purpose of ABD.

Scatter ABD's use at least one page each, so sub-page allocations waste
some space when allocated as scatter (e.g. 2KB scatter allocation wastes
half of each page).  Using linear ABD's for small allocations means that
they will be put on slabs which contain many allocations.  This can
improve memory efficiency, but it also makes it much harder for ARC
evictions to actually free pages, because all the buffers on one slab
need to be freed in order for the slab (and underlying pages) to be
freed.  Typically, 512B and 1KB kmem caches have 16 buffers per slab, so
it's possible for them to actually waste more memory than scatter (one
page per buf = wasting 3/4 or 7/8th; one buf per slab = wasting
15/16th).

Spill blocks are typically 512B and are heavily used on systems running
selinux with the default dnode size and the `xattr=sa` property set.

By default we will use linear allocations for 512B and 1KB, and scatter
allocations for larger (1.5KB and up).

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by:  DHE <git@dehacked.net>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8455
2019-02-28 17:52:55 -08:00
beren12 3a1f2d533d Remove zfs-zed hard dep from zfs-share init script
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: Chris Zubrzycki <github@mid-earth.net>
Closes #8447
2019-02-28 12:07:03 -08:00
Michael Niewöhner 46164122c0 initramfs/debian: use panic() instead of directly calling /bin/sh
Debian has a panic() function which makes it possible to disable shell
access in initramfs by setting the panic kernel parameter. Use it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kash Pande <kash@tripleback.net>
Signed-off-by: Michael Niewöhner <foss@mniewoehner.de>
Closes #8448
2019-02-28 12:05:55 -08:00
Allan Jude d6838ae649 zstreamdump: include embedded writes when dumping raw data (-d)
When feeding a replication stream to `zstreamdump -d` (raw dump mode),
it does not print the raw data for DRR_WRITE_EMBEDDED records.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes #8430
2019-02-27 17:55:25 -08:00
Brian Behlendorf 6af7ba417e Fix overly broad spa config lock
The spa_txg_history_init_io() and spa_txg_history_fini_io() were
mistakenly taking SCL_ALL when only SCL_CONFIG is required to
access the vdev stats.  This could result in a deadlock which
was observed when running ztest.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8445
2019-02-27 10:49:22 -08:00
Matthew Ahrens c568ab8d99 zfs.8 has wrong description of "zfs program -t"
The "-t" argument to "zfs program" specifies a limit on the number of
LUA instructions that can be executed.  The zfs.8 manpage has the wrong
description.  It should be updated to match what's in zfs-program.8

Also fix the formatting of the zfs help message.

Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8410
2019-02-26 11:15:28 -08:00
kpande 47d7ef5490 Sort by full path name instead of by GUID when importing
Preferentially sort by the full path name instead of GUID when determining
which device links to use.  This helps ensure that the pool vdevs are named
consistently when multiple links for a device appear in the same directory.
For example, the /dev/disk/by-id/scsi* and /dev/disk/by-id/wwn* links.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kash Pande <kash@tripleback.net>
Closes #8108 
Closes #8440
2019-02-26 11:13:15 -08:00
Damian Wojsław e065034563 Improve error message for zfs create with @ or # in name
Reorder the `zfs create` error messages in order to return the most
specific one first.  If none of them apply then an expanded version of
the invalid name message is used.

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Damian Wojsław <damian@wojslaw.pl>
Closes #8155 
Closes #8352
2019-02-25 11:20:07 -08:00
DeHackEd ba7b05cb25 zfs(8): improve document of compression behaviours
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: DHE <git@dehacked.net>
Closes #4660 
Closes #8423
2019-02-25 11:10:16 -08:00
Serapheim Dimitropoulos 8eef997679 Error path in metaslab_load_impl() forgets to drop ms_sync_lock
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8444
2019-02-25 11:08:52 -08:00
loli10K c44a3ec059 zvol: allow rename of in use ZVOL dataset
While ZFS allow renaming of in use ZVOLs at the DSL level without issues
the ZVOL layer does not correctly update the renamed dataset if the
device node is open (zv->zv_open_count > 0): trying to access the stale
dataset name, for instance during a zfs receive, will cause the
following failure:

VERIFY3(zv->zv_objset->os_dsl_dataset->ds_owner == zv) failed ((null) == ffff8800dbb6fc00)
PANIC at zvol.c:1255:zvol_resume()
Showing stack for process 1390
CPU: 0 PID: 1390 Comm: zfs Tainted: P           O  3.16.0-4-amd64 #1 Debian 3.16.51-3
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 0000000000000000 ffffffff8151ea00 ffffffffa0758a80 ffff88028aefba30
 ffffffffa0417219 ffff880037179220 ffffffff00000030 ffff88028aefba40
 ffff88028aefb9e0 2833594649524556 6f5f767a3e2d767a 6f3e2d7465736a62
Call Trace:
 [<0>] ? dump_stack+0x5d/0x78
 [<0>] ? spl_panic+0xc9/0x110 [spl]
 [<0>] ? mutex_lock+0xe/0x2a
 [<0>] ? zfs_refcount_remove_many+0x1ad/0x250 [zfs]
 [<0>] ? rrw_exit+0xc8/0x2e0 [zfs]
 [<0>] ? mutex_lock+0xe/0x2a
 [<0>] ? dmu_objset_from_ds+0x9a/0x250 [zfs]
 [<0>] ? dmu_objset_hold_flags+0x71/0xc0 [zfs]
 [<0>] ? zvol_resume+0x178/0x280 [zfs]
 [<0>] ? zfs_ioc_recv_impl+0x88b/0xf80 [zfs]
 [<0>] ? zfs_refcount_remove_many+0x1ad/0x250 [zfs]
 [<0>] ? zfs_ioc_recv+0x1c2/0x2a0 [zfs]
 [<0>] ? dmu_buf_get_user+0x13/0x20 [zfs]
 [<0>] ? __alloc_pages_nodemask+0x166/0xb50
 [<0>] ? zfsdev_ioctl+0x896/0x9c0 [zfs]
 [<0>] ? handle_mm_fault+0x464/0x1140
 [<0>] ? do_vfs_ioctl+0x2cf/0x4b0
 [<0>] ? __do_page_fault+0x177/0x410
 [<0>] ? SyS_ioctl+0x81/0xa0
 [<0>] ? async_page_fault+0x28/0x30
 [<0>] ? system_call_fast_compare_end+0x10/0x15

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6263 
Closes #8371
2019-02-22 15:38:42 -08:00
loli10K 0c637f3100 zpool reports 16E expandsize on disks with oddball number of sectors
The issue is caused by a small discrepancy in how userland creates the
partition layout and the kernel estimates available space:

 * zpool command: subtract 9M from the usable device size, then align
   to 1M boundary. 9M is the sum of 1M "start" partition alignment + 8M
   EFI "reserved" partition.

 * kernel module: subtract 10M from the device size. 10M is the sum of
   1M "start" partition alignment + 1m "end" partition alignment + 8M
   EFI "reserved" partition.

For devices where the number of sectors is not a multiple of the
alignment size the zpool command will create a partition layout which
reserves less than 1M after the 8M EFI "reserved" partition:

  Disk /dev/sda: 1024 MiB, 1073739776 bytes, 2097148 sectors
  Units: sectors of 1 * 512 = 512 bytes
  Sector size (logical/physical): 512 bytes / 512 bytes
  I/O size (minimum/optimal): 512 bytes / 512 bytes
  Disklabel type: gpt
  Disk identifier: 49811D40-16F4-4E41-84A9-387703950D7F

  Device       Start     End Sectors  Size Type
  /dev/sda1     2048 2078719 2076672 1014M Solaris /usr & Apple ZFS
  /dev/sda9  2078720 2095103   16384    8M Solaris reserved 1

When the kernel module vdev_open() the device its max_asize ends up
being slightly smaller than asize: this results in a huge number (16E)
reported by metaslab_class_expandable_space().

This change prevents bdev_max_capacity() from returing a size smaller
than bdev_capacity().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #1468 
Closes #8391
2019-02-22 15:36:34 -08:00
lidongyang 8d9e51c084 Fix dnode_hold_impl() soft lockup
Soft lockups could happen when multiple threads trying
to get zrl on the same dnode handle in order to allocate
and initialize the dnode marked as DN_SLOT_ALLOCATED.

Don't loop from beginning when we can't get zrl, otherwise
we would increase the zrl refcount and nobody can actually
lock it.

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyangli@ddn.com>
Closes #8433
2019-02-22 09:48:37 -08:00
kpande f8bb2a7e0c Clarify zpool iostat statistics reporting
Document expected behavior for  zpool iostat statistics reporting.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Signed-off-by: Kash Pande <kash@tripleback.net>
Closes #2888 
Closes #8417
2019-02-21 14:00:48 -08:00
Anatoly Borodin f23b0242b6 Fix '-T u|d' descriptions in zpool(8)
In

	-T u|d  Display a time stamp.  Specify -u for a printed
		representation of the internal representation of time.
		See time(2).  Specify -d for standard date format.
		See date(1).

'Specify u' and 'Specify d' should be used instead. `zpool list -T -u`
does not work.

Bring the descriptions in `zpool list` and `zpool status` in sync with
`zpool iostat`.

Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Anatoly Borodin <anatoly.borodin@gmail.com>
Closes #8438
2019-02-21 11:22:06 -08:00
Tomohiro Kusumi 9abbee4912 Don't enter zvol's rangelock for read bio with size 0
The SCST driver (SCSI target driver implementation) and possibly
others may issue read bio's with a length of zero bytes. Although
this is unusual, such bio's issued under certain condition can cause
kernel oops, due to how rangelock is implemented.

rangelock_add_reader() is not made to handle overlap of two (or more)
ranges from read bio's with the same offset when one of them has size
of 0, even though they conceptually overlap. Allowing them to enter
rangelock results in kernel oops by dereferencing invalid pointer,
or assertion failure on AVL tree manipulation with debug enabled
kernel module.

For example, this happens when read bio whose (offset, size) is
(0, 0) enters rangelock followed by another read bio with (0, 4096)
when (0, 0) rangelock is still locked, when there are no pending
write bio's. It can also happen with reverse order, which is (0, N)
followed by (0, 0) when (0, N) is still locked. More details
mentioned in #8379.

Kernel Oops on ->make_request_fn() of ZFS volume
https://github.com/zfsonlinux/zfs/issues/8379

Prevent this by returning bio with size 0 as success without entering
rangelock. This has been done for write bio after checking flusher
bio case (though not for the same reason), but not for read bio.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8379 
Closes #8401
2019-02-20 10:14:36 -08:00
kpande 1e427f2e2b Add diffutils dependency for dkms build
The cmp and diff utilities are required at configure time.  Add
a dependency on diffutils to ensure they are installed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kash Pande <kash@tripleback.net>
Closes #5205 
Closes #8428
2019-02-20 10:04:05 -08:00
Serapheim Dimitropoulos 928e8ad47d Introduce auxiliary metaslab histograms
This patch introduces 3 new histograms per metaslab. These
histograms track segments that have made it to the metaslab's
space map histogram (and are part of the spacemap) but have
not yet reached the ms_allocatable tree on loaded metaslab's
because these metaslab's are currently syncing and haven't
gone through metaslab_sync_done() yet.

The histograms help when we decide whether to load an unloaded
metaslab in-order to allocate from it. When calculating the
weight of an unloaded metaslab traditionally, we look at the
highest bucket of its spacemap's histogram.  The problem is
that we are not guaranteed to be able to allocated that
segment when we load the metaslab because it may still be at
the freeing, freed, or defer trees. The new histograms are
used when we try to calculate an unloaded metaslab's weight
to deal with this issue by removing segments that have would
not be in the allocatable tree at runtime. Note, that this
method of dealing with this is not completely accurate as
adjacent segments are not always consolidated in the space
map histogram of a metaslab.

In addition and to make things deterministic, we always reset
the weight of unloaded metaslabs based on their space map
weight (instead of doing that on a need basis). Thus, every
time a metaslab is loaded and its weight is reset again (from
the weight based on its space map to the one based on its
allocatable range tree) we expect (and assert) that this
change in weight can only get better if it doesn't stay the
same.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8358
2019-02-20 09:59:56 -08:00
loli10K bb1be77a35 Prevent user accounting on readonly pool
Trying to mount a dataset from a readonly pool could inadvertently start
the user accounting upgrade task, leading to the following failure:

VERIFY3(tx->tx_threads == 2) failed (0 == 2)
PANIC at txg.c:680:txg_wait_synced()
Showing stack for process 2541
CPU: 2 PID: 2541 Comm: z_upgrade Tainted: P           O  3.16.0-4-amd64 #1 Debian 3.16.51-3
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
Call Trace:
 [<0>] ? dump_stack+0x5d/0x78
 [<0>] ? spl_panic+0xc9/0x110 [spl]
 [<0>] ? dnode_next_offset+0x1d4/0x2c0 [zfs]
 [<0>] ? dmu_object_next+0x77/0x130 [zfs]
 [<0>] ? dnode_rele_and_unlock+0x4d/0x120 [zfs]
 [<0>] ? txg_wait_synced+0x91/0x220 [zfs]
 [<0>] ? dmu_objset_id_quota_upgrade_cb+0x10f/0x140 [zfs]
 [<0>] ? dmu_objset_upgrade_task_cb+0xe3/0x170 [zfs]
 [<0>] ? taskq_thread+0x2cc/0x5d0 [spl]
 [<0>] ? wake_up_state+0x10/0x10
 [<0>] ? taskq_thread_should_stop.part.3+0x70/0x70 [spl]
 [<0>] ? kthread+0xbd/0xe0
 [<0>] ? kthread_create_on_node+0x180/0x180
 [<0>] ? ret_from_fork+0x58/0x90
 [<0>] ? kthread_create_on_node+0x180/0x180

This patch updates both functions responsible for checking if we can
perform user accounting to verify the pool is not readonly.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8424
2019-02-19 18:41:18 -08:00
Ned Bass 75d6b7ddca Add missing copyright notice to large_dnode tests
Missing copyright notices were noticed during the Illumos
RTI process. Add LLNS 2016 copyright based on original merge
date.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #8435
2019-02-19 18:39:10 -08:00
Igor K 790c880e8c Fix zdb crash
We have to use umem_free() instead of free() if we are using
umem_zalloc()

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes #8402
2019-02-19 11:15:22 -08:00
John Wren Kennedy 435637d1ed ZTS: user_property_002_pos fails to destroy volume
During the cleanup function of this test, an attempt to destroy a volume
can fail because the volume is busy. This leaves the system with
unexpected datasets which in turn causes subsequent failures.

Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Kennedy <john.kennedy@delphix.com>
Closes #8422
2019-02-19 11:12:47 -08:00
kpande 11f6127aba zfs mount man page should document legacy behaviour
Document legacy mount behavior.

Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Kash Pande <kash@tripleback.net>
Closes #2900
Closes #8414
2019-02-19 11:10:57 -08:00
Sara Hartse f545b6ae00 Delay injection can cause indefinitely hung zios
If we hit the (NSEC_TO_TICK(diff) == 0) condition in
zio_delay_interrupt, zio_interrupt is never called and the
zio does not progress.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
Closes #8404
2019-02-15 14:44:56 -08:00
Don Brady a28c1a58fe ZFS mounted NFSv3 shares fail lock reclaims
ZFS NFS shares mounted on a client with NFSv3 and with open 
locks will fail to reclaim those locks after a server reboot. 

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #8398
2019-02-15 14:40:16 -08:00
John Wren Kennedy 07237a7bc1 ZTS: clone_001_pos fails in cleanup on busy dataset
The "cleanup_all" function in this test calls "zfs destroy" which
fails approximately 30% of the time in our environment due to the
dataset being busy. Since the failure happens during cleanup, the
error is propagated to subsequent tests.

Tested by running the snapshot test group in a loop without seeing
any failures.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Kennedy <john.kennedy@delphix.com>
Closes #8409
2019-02-15 12:45:46 -08:00
Tim Chase 638dd5f44e zio_deadman_impl() fix and enhancement
Add the zio_deadman_log_all tunable to print all zios in
zio_deadman_impl().  Also, in all cases, display the depth of the
zio relative to the original parent zio.  This is meant to be used by
developers to gain diagnostic information for hangs which don't involve
fully set-up zio trees or are otherwise stuck or hung in an early stage.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #8362
2019-02-15 12:44:24 -08:00
Paul Zuchowski 9c5e88b1de zfs should optionally send holds
Add -h switch to zfs send command to send dataset holds. If
holds are present in the stream, zfs receive will create them
on the target dataset, unless the zfs receive -h option is used
to skip receive of holds.

Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7513
2019-02-15 12:41:38 -08:00
Tony Hutter e73ab1b38c Linux 4.20 compat: Fix VERIFY(RW_READ_HELD(&hash->mh_contents))
The 4.20 kernel changed the meaning of the rw_semaphore.owner bits,
causing an assertion when loading the module under the 4.20 kernel.
This patch fixes the issue.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8360 
Closes #8389
2019-02-15 12:37:20 -08:00
Tomohiro Kusumi 2d76ab9e42 Fix obsolete comment on rangelock
5d43cc9a59 renamed it to rangelock_enter().

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8408
2019-02-14 15:13:58 -08:00
Igor K cf89a4ec9d zdb: replace label_t to zdb_label_t for reduce collisions
with builds on illumos based platform we can see build issue
because label_t has been redefined.

for reduce build issues on others platforms we should rename
label_t to zdb_label_t.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Igor Kozhukhov <igor@dilos.org>
Closes #8397
2019-02-13 11:28:36 -08:00
Alek P 65282ee9e0 Freeing throttle should account for holes
Deletion throttle currently does not account for holes in a file.
This means that it can activate when it shouldn't.
To fix it we switch the throttle to be based on the number of
L1 blocks we will have to dirty when freeing

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #7725 
Closes #7888
2019-02-12 12:01:08 -08:00
Alek P dcec0a12c8 port async unlinked drain from illumos-nexenta
This patch is an async implementation of the existing sync
zfs_unlinked_drain() function. This function is called at mount time and
is responsible for freeing znodes that we didn't get to freeing before.
We don't have to hold mounting of the dataset until the unlinked list is
fully drained as is done now. Since we can process the unlinked set
asynchronously this results in a better user experience when mounting a
dataset with entries in the unlinked set.

Reviewed by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #8142
2019-02-12 10:41:15 -08:00
Serapheim Dimitropoulos 425d3237ee Get rid of space_map_update() for ms_synced_length
Initially, metaslabs and space maps used to be the same thing
in ZFS. Later, we started differentiating them by referring
to the space map as the on-disk state of the metaslab, making
the metaslab a higher-level concept that is metadata that deals
with space accounting. Today we've managed to split that code
furthermore, with the space map being its own on-disk data
structure used in areas of ZFS besides metaslabs (e.g. the
vdev-wide space maps used for zpool checkpoint or vdev removal
features).

This patch refactors the space map code to further split the
space map code from the metaslab code. It does so by getting
rid of the idea that the space map can have a different in-core
and on-disk length (sm_length vs smp_length) which is something
that is only used for the metaslab code, and other consumers
of space maps just have to deal with. Instead, this patch
introduces changes that move the old in-core length of the
metaslab's space map to the metaslab structure itself (see
ms_synced_length field) while making the space map code only
care about the actual space map's length on-disk.

The result of this is that space map consumers no longer have
to deal with syncing two different lengths for the same
structure (e.g. space_map_update() goes away) while metaslab
specific behavior stays within the metaslab code. Specifically,
the ms_synced_length field keeps track of the amount of data
metaslab_load() can read from the metaslab's space map while
working concurrently with metaslab_sync() that may be
appending to that same space map.

As a side note, the patch also adds a few comments around
the metaslab code documenting some assumptions and expected
behavior.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8328
2019-02-12 10:38:11 -08:00
loli10K d8d418ff0c ZVOLs should not be allowed to have children
zfs create, receive and rename can bypass this hierarchy rule. Update
both userland and kernel module to prevent this issue and use pyzfs
unit tests to exercise the ioctls directly.

Note: this commit slightly changes zfs_ioc_create() ABI. This allow to
differentiate a generic error (EINVAL) from the specific case where we
tried to create a dataset below a ZVOL (ZFS_ERR_WRONG_PARENT).

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
2019-02-08 15:44:15 -08:00
loli10K 4417096956 Pool allocation classes misplacing small file blocks
Due to an off-by-one condition in spa_preferred_class() we are picking
the "normal" allocation class instead of the "special" one for file
blocks with size equal to the special_small_blocks property value.

This change fix the small code issue, update the ZFS Test Suite and the
zfs(8) man page.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8351
Closes #8361
2019-02-08 12:32:12 -08:00
Tim Chase 0902c4577f Fix ARC stats for embedded blkptrs
Re-factor arc_read() to better account for embedded data blkptrs.
Previously, reading the payload from an embedded blkptr would cause
arcstats such as demand_metadata_misses to be bumped when there was
actually no cache "miss" because the data are already available in
the blkptr.

The following test procedure was used to demonstrate the problem:

   zpool create tank ...
   zfs create -o compression=lz4 tank/fs
   echo blah > /tank/fs/blah
   stat /tank/fs/blah
   grep 'meta.*mis' /proc/spl/kstat/zfs/arcstats

and repeating the last two steps to watch the metadata miss counter
increment.  This can also be demonstrated via the  zfs_arc_miss DTRACE4
probe in arc_read().

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #8319
2019-02-04 09:33:30 -08:00
Ahmed Ghanem 9634299657 OpenZFS 9185 - Enable testing over NFS in ZFS performance tests
This change makes additions to the ZFS test suite that allows the
performance tests to run over NFS. The test is run and performance data
collected from the server side, while IO is generated on the NFS client.

This has been tested with Linux and illumos NFS clients.

Authored by: Ahmed Ghanem <ahmedg@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Kevin Greene <kevin.greene@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: John Kennedy <john.kennedy@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/9185
Closes #8367
2019-02-04 09:27:37 -08:00
loli10K 1a745ef62e zstreamdump: -d option is not documented in manpage
This change simply documents the missing -d (dump contents) option in
zstreamdump(8).

Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8369
2019-02-04 09:13:00 -08:00
bunder2015 bf6ca0a631 shellcheck pass
note: which is non-standard. Use builtin 'command -v' instead. [SC2230]
note: Use -n instead of ! -z. [SC2236]

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #8367
2019-02-04 09:07:19 -08:00
bunder2015 cca14128c9 flake8 pass
F632 use ==/!= to compare str, bytes, and int literals

Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #8368
2019-02-04 09:02:46 -08:00
Tony Hutter 57dc41de96 Fix zpool iostat -w header names
The zpool iostat latency histograms (-w) has column names
'sync_queue' and 'async_queue', which do not match the man page, nor
the equivalent columns in average latency.  Change the column
names to be 'syncq_wait' and 'asyncq_wait' to be consistent.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8338
2019-01-31 10:51:18 -08:00
Serapheim Dimitropoulos 6c926f426a Simplify log vdev removal code
Get rid of the majority metaslab metadata when removing log vdevs
in spa_vdev_remove_log() with a call to metaslab_fini() instead
of duplicating a lot of that in vdev_remove_empty_log().

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8347
2019-01-31 09:17:52 -08:00
Serapheim Dimitropoulos 7558997d2f vs_alloc can underflow in L2ARC vdevs
The current L2 ARC device code consistently uses psize to
increment vs_alloc but varies between psize and lsize when
decrementing it. The result of this behavior is that
vs_alloc can be decremented more that it is incremented
and underflow. This patch changes the code so asize is
used anywhere.

In addition, it ensures that vs_alloc gets incremented by
the L2 ARC device code as buffers are written and not at
the end of the l2arc_write_buffers() routine. The latter
(and old) way would temporarily underflow vs_alloc as
buffers that were just written, would be destroyed while
l2arc_write_buffers() was still looping.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8298
2019-01-31 09:16:39 -08:00
Sara Hartse 2747f599ff Don't acquire zthr_request_lock in zthr_wakeup
Address a deadlock caused by simultaneous wakeup and cancel on a zthr
by remove the hold of zthr_request_lock from zthr_wakeup. This
allows thr_wakeup to not block a thread that is in the process of
being cancelled.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Closes #8333
2019-01-30 12:31:16 -08:00
Serapheim Dimitropoulos 21e7cf5da8 zdb -L should skip leak detection altogether
Currently the point of -L option in zdb is to  disable leak
tracing and the loading of space maps because they are expensive,
yet still do leak detection in terms of space. Unfortunately,
there is a scenario where this is a lie. If we are using zdb -L
on a pool where a vdev is being removed, zdb_claim_removing()
will open the metaslab space maps of that device.

This patch makes it so zdb -L skips leak detection altogether
and ensures that no space maps are loaded.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8335
2019-01-30 09:54:27 -08:00
Tony Hutter 466f55334a Exclude test-runner.py from the rpmbuild shebang check
Exclude test-runner.py from the rpmbuild shebang check to allow it to
run under Python 2 and 3.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8331
2019-01-28 10:11:45 -08:00
Tony Hutter caacc6e4c4 GCC 9.0: Fix ztest "directive argument is not a nul-terminated string"
GCC 9.0 is complaining because we're trying to print strings that
are defined like this:

.zo_pool = { 'z', 't', 'e', 's', 't', '\0' },

Fix them by making them actual strings.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8330
2019-01-28 10:11:45 -08:00
Brian Behlendorf 26a856594f Linux 5.0 compat: Fix bio_set_dev()
The Linux 5.0 kernel updated the bio_set_dev() macro so it calls the
GPL-only bio_associate_blkg() symbol thus inadvertently converting
the entire macro.  Provide a minimal version which always assigns the
request queue's root_blkg to the bio.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8287
2019-01-28 10:11:45 -08:00
Tony Hutter 0c593296e9 Linux 5.0 compat: Disable vector instructions on 5.0+ kernels
The 5.0 kernel no longer exports the functions we need to do vector
(SSE/SSE2/SSE3/AVX...) instructions.  Disable vector-based checksum
algorithms when building against those kernels.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8259
2019-01-28 10:11:45 -08:00
Tony Hutter ed158b19b1 Linux 5.0 compat: Fix SUBDIRs
SUBDIRs has been deprecated for a long time, and was finally removed in
the 5.0 kernel.  Use "M=" instead.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8257
2019-01-28 10:11:45 -08:00
Tony Hutter 05805494dd Linux 5.0 compat: Convert MS_* macros to SB_*
In the 5.0 kernel, only the mount namespace code should use the MS_*
macos. Filesystems should use the SB_* ones.

https://patchwork.kernel.org/patch/10552493/

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8264
2019-01-28 10:11:39 -08:00
Tony Hutter 031cea17a3 Linux 5.0 compat: Use totalram_pages()
totalram_pages() was converted to an atomic variable in 5.0:

https://patchwork.kernel.org/patch/10652795/

Its value should now be read though the totalram_pages() helper
function.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8263
2019-01-28 10:11:14 -08:00
Tony Hutter 77e50c3070 Linux 5.0 compat: access_ok() drops 'type' parameter
access_ok no longer needs a 'type' parameter in the 5.0 kernel.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8261
2019-01-28 10:11:10 -08:00
Tony Hutter 5cb46f6a66 Linux 4.18 compat: Use ktime_get_coarse_real_ts64()
Newer kernels remove current_kernel_time64().  Use
ktime_get_coarse_real_ts64() in its place.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8258
2019-01-28 10:11:03 -08:00
Serapheim Dimitropoulos c853f382db Change target size of metaslabs from 256GB to 16GB
= Old behavior

For vdev sizes 100GB to 50TB we keep ~200 metaslabs per
vdev and the metaslab size grows from 512MB to 256GB.
For vdev's bigger than that we start increasing the
number of metaslabs until we hit the 128K limit.

= New Behavior

For vdev sizes 100GB to 3TB we keep ~200 metaslabs per
vdev and the metaslab size grows from 512MB to 16GB.
For vdev's bigger than that we start increasing the
number of metaslabs until we hit the 128K limit.

= Reasoning

The old behavior makes metaslabs grow in size when
the vdev range is between 3TB (ms_size 16GB) and
32PB (ms_size 256GB). Even though keeping the number
of metaslabs is good in terms of potential number of
I/Os per TXG, these bigger metaslabs take longer
to be loaded and after they are loaded they can
take up a lot of memory because of their range trees.

This change tries to put a boundary in memory and
loading time for the specific range of vdev sizes.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8324
2019-01-25 16:38:27 -08:00
Serapheim Dimitropoulos df72b8bebe Rename range_tree_verify to range_tree_verify_not_present
The range_tree_verify function looks for a segment in a
range tree and panics if the segment is present on the
tree. This patch gives the function a more descriptive
name.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8327
2019-01-25 09:51:24 -08:00
Tim Chase 107dd2b174 Use proper tag for spa config refcounts in mmp_write_uberblock()
This allows the spa config refcounts to use tracking in debug builds
without triggering the "No such hold %p on refcount" panic.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #8326
2019-01-25 09:50:06 -08:00
loli10K 7646af20ad zfs userspace dumps core when used on ZVOLs
If you try to get the userspace, groupspace or projectspace on a ZVOL,
the generated error results in passing EINVAL to
zfs_standard_error_fmt() when we should return a specific error to
inform the user that those properties aren't available on volumes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8279
2019-01-25 09:47:52 -08:00
Damian Wojsław 8fccfa8e17 zpool iostat should print headers when terminal fills
When `zpool iostat` fills the terminal the headers should be
printed again.  `zpool iostat -n` can be used to suppress this.

If the command is not attached to a tty, headers will not be
printed so as to not break existing scripts.

Reviewed-by: Joshua M. Clulow <josh@sysmgr.org>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Damian Wojsław <damian@wojslaw.pl>
Closes #8235
Closes #8262
2019-01-23 13:29:49 -08:00
Tom Caputi b5d693581d Fix bad kmem_free() in zvol_rename_minors_impl()
Currently, zvol_rename_minors_impl() calls kmem_asprintf()
to allocate and initialize a string. This function is a thin
wrapper around the kernel's kvasprintf() and does not call
into the SPL's kmem tracking code when it is enabled. However,
this function frees the string with the tracked kmem_free()
instead of the untracked strfree(), which causes the SPL
kmem tracking code to believe that the function is attempting
to free memory it never allocated, triggering an ASSERT. This
patch simply corrects this issue.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8307
2019-01-23 11:38:05 -08:00
loli10K 0a10863194 ztest: creates partially initialized root dataset
Since d8fdfc2 was integrated dsl_pool_create() does not call
dmu_objset_create_impl() for the root dataset when running in
userland (ztest): this creates a pool with a partially initialized
root dataset. Trying to import and use this pool results in both
zpool and zfs executables dumping core.

Fix this by adopting an alternative change suggested in OpenZFS 8607
code review.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Original-patch-by: Robert Mustacchi <rm@joyent.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8277
2019-01-18 11:14:01 -08:00
Brian Behlendorf ad63507135 Remove zfs_sync() panicking kernel check
This check provides no real additional protection and unnecessarily
introduces a dependency on the "oops_in_progress" kernel symbol.
Remove the check, it there are special circumstances on other
platforms which make this a requirement it can be reintroduced
for all relevant call paths in a more portable comprehensive manor.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8297
2019-01-18 11:11:47 -08:00
Serapheim Dimitropoulos b194fab0fb Factor metaslab_load_wait() in metaslab_load()
Most callers that need to operate on a loaded metaslab, always
call metaslab_load_wait() before loading the metaslab just in
case someone else is already doing the work.

Factoring metaslab_load_wait() within metaslab_load() makes the
later more robust, as callers won't have to do the load-wait
check explicitly every time they need to load a metaslab.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8290
2019-01-18 11:10:32 -08:00
Tom Caputi 960347d3a6 Fix 0 byte memory leak in zfs receive
Currently, when a DRR_OBJECT record is read into memory in
receive_read_record(), memory is allocated for the bonus buffer.
However, if the object doesn't have a bonus buffer the code will
still "allocate" the zero bytes, but the memory will not be passed
to the processing thread for cleanup later. This causes the spl
kmem tracking code to report a leak. This patch simply changes the
code so that it only allocates this memory if it has a non-zero
length.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8266
2019-01-18 11:06:48 -08:00
Serapheim Dimitropoulos 1a759200e5 Document guidelines for usage of zfs_dbgmsg
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8299
2019-01-18 10:16:56 -08:00
Neal Gompa (ニール・ゴンパ) e45c1734a6 dkms: Enable debuginfo option to be set with zfs sysconfig file
On some Linux distributions, the kernel module build will not
default to building with debuginfo symbols, which can make it
difficult for debugging and testing.

For this case, we provide a flag to override the build to force
debuginfo to be produced for the kernel module build.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Neal Gompa <ngompa@datto.com>
Co-authored-by: Simon Watson <swatson@datto.com>
Signed-off-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Simon Watson <swatson@datto.com>
Closes #8304
2019-01-18 10:10:24 -08:00
loli10K 60b0a963f5 Off-by-one in zap_leaf_array_create()
Trying to set user properties with their length 1 byte shorter than the
maximum size triggers an assertion failure in zap_leaf_array_create():

  panic[cpu0]/thread=ffffff000a092c40:
  assertion failed: num_integers * integer_size < (8<<10) (0x2000 < 0x2000), file: ../../common/fs/zfs/zap_leaf.c, line: 233

  ffffff000a092500 genunix:process_type+167c35 ()
  ffffff000a0925a0 zfs:zap_leaf_array_create+1d2 ()
  ffffff000a092650 zfs:zap_entry_create+1be ()
  ffffff000a092720 zfs:fzap_update+ed ()
  ffffff000a0927d0 zfs:zap_update+1a5 ()
  ffffff000a0928d0 zfs:dsl_prop_set_sync_impl+5c6 ()
  ffffff000a092970 zfs:dsl_props_set_sync_impl+fc ()
  ffffff000a0929b0 zfs:dsl_props_set_sync+79 ()
  ffffff000a0929f0 zfs:dsl_sync_task_sync+10a ()
  ffffff000a092a80 zfs:dsl_pool_sync+3a3 ()
  ffffff000a092b50 zfs:spa_sync+4e6 ()
  ffffff000a092c20 zfs:txg_sync_thread+297 ()
  ffffff000a092c30 unix:thread_start+8 ()

This patch simply corrects the assertion.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8278
2019-01-18 09:58:46 -08:00
Serapheim Dimitropoulos 8dc2197b7b Simplify spa_sync by breaking it up to smaller functions
The point of this refactoring is to break the high-level conceptual 
steps of spa_sync() to their own helper functions. In general large 
functions can enhance readability if structured well, but in this
case the amount of conceptual steps taken could use the help of 
helper functions.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8293
2019-01-18 09:50:16 -08:00
Brian Behlendorf ce5fb2a7c6 ztest: scrub verification
By design ztest will never inject non-repairable damage in to the
pool.  Update the ztest_scrub() test case such that it waits for
the scrub to complete and verifies the pool is always repairable.

After enabling scrub verification two scenarios were encountered
which are the result of how ztest manages failure injection.

The first case is straight forward and pertains to detaching a
mirror vdev.  In this case, the pool must always be scrubbed prior
the detach.  Failure to do so can potentially lock in previously
repairable data corruption by removing all good copies of a block
leaving only damaged ones.

The second is a little more subtle.  The child/offset selection
logic in ztest_fault_inject() depends on the calculated number of
leaves always remaining constant between injection passes.  This
is true within a single execution of ztest, but when using zloop.sh
random values are selected for each restart.  Therefore, when ztest
imports an existing pool it must be scrubbed before failure injection
can be safely enabled.  Otherwise it is possible that it will inject
non-repairable damage.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8269
2019-01-18 09:47:55 -08:00
Tom Caputi 305781da4b Fix error handling incallers of dbuf_hold_level()
Currently, the functions dbuf_prefetch_indirect_done() and
dmu_assign_arcbuf_by_dnode() assume that dbuf_hold_level() cannot
fail. In the event of an error the former will cause a NULL pointer
dereference and the later will trigger a VERIFY. This patch adds
error handling to these functions and their callers where necessary.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8291
2019-01-17 15:47:08 -08:00
Serapheim Dimitropoulos 75058f3303 Remove unused vdev_t fields
The following fields from the vdev_t struct are not used anywhere.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8285
2019-01-17 15:41:12 -08:00
Brian Behlendorf 52b684236d ztest: scrub ddt repair
The ztest_ddt_repair() test is designed inflict damage to the
ddt which can be repairable by a scrub.  Unfortunately, this
repair logic was broken at some point and it went undetected.
This issue is not specific to ztest, but thankfully this extra
redundancy is rarely enabled and even more rarely needed.

The root cause was identified to be the ddt_bp_create()
function called by dsl_scan_ddt_entry() which did not set the
dedup bit of the generated block pointer.

The consequence of this was that the ZIO_DDT_READ_PIPELINE was
never enabled for the block pointer during the scrub, and the
dedup ditto repair logic was never run.  Note that for demand
reads which don't rely on ddt_bp_create() the required pipeline
stages would be enabled and the repair performed.

This was resolved by unconditionally setting the dedup bit in
ddt_bp_create().  This way all codes paths which may need to
perform a repair from a block pointer generated from the dtt
entry will be able too.  The only exception is that the dedup
bit is cleared in ddt_phys_free() which is required to avoid
leaking space.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8270
2019-01-17 15:25:00 -08:00
Serapheim Dimitropoulos 419ba59145 Update vdev_is_spacemap_addressable() for new spacemap encoding
Since the new spacemap encoding was ported to ZoL that's no longer 
a limitation. This patch updates vdev_is_spacemap_addressable() 
that was performing that check.

It also updates the appropriate test to ensure that the same 
functionality is tested.  The test does so by creating pools that 
don't have the new spacemap encoding enabled - just the checkpoint
feature. This patch also reorganizes that same tests in order to 
cut in half its memory consumption.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8286
2019-01-16 15:06:20 -08:00
Brian Behlendorf 64bdf63f5c ztest: split block reconstruction
Increase the default allowed number of reconstruction attempts.
There's not an exact right number for this setting.  It needs
to be set large enough to cover any realistic failure scenarios
and small enough to avoid stalling the IO pipeline and invoking
the dead man detection.

The current value of 256 was empirically determined to be too
low based on multi-day runs of ztest.  The fault injection code
would inject more damage than could be reconstructed given the
relatively small number of attempts.  However, in all observed
cases the block could be reconstructed using a slightly higher
limit.

Based on local testing increasing the default value to 4096 was
determined to strike the best balance.  Checking all combinations
takes less than 10s in the worst case, and has so far eliminated
the vast majority of false positives detected by ztest.  This
delay is roughly on par with how long retries may be performed
to a misbehaving HDD and was deemed to be reasonable.  Better to
err on the side of a brief delay rather than fail to reconstruct
the data.

Lastly, the -Y flag has been added to zdb to make it easy to try all
possible combinations when performing split block reconstruction.
For badly damaged blocks with 18 splits, they can be fully enumerated
within a few minutes.  This has been done to ensure permanent errors
are never incorrectly reported when ztest verifies the pool with zdb.

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8271
2019-01-16 14:10:02 -08:00
Serapheim Dimitropoulos db587941c5 Make zdb results for checkpoint tests consistent
This patch exports and re-imports the pool when these tests are
analyzed with zdb to get consistent results.

Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8292
2019-01-16 10:41:47 -08:00
Brian Behlendorf 6e91a72fe3 Disable 'zfs remap' command
The implementation of 'zfs remap' has proven to be problematic since
it modifies the objset (but not its logical contents) by dirtying
metadata without owning it.  The consequence of which is that
dmu_objset_remap_indirects() is vulnerable to certain races.

For example, if we are in the middle of receiving into the filesystem
while it is being remapped.  Then it is possible we could evict the
objset when the receive completes (see dsl_dataset_clone_swap_sync_impl,
or dmu_recv_end_sync), but dmu_objset_remap_indirects() may be still
using the objset.  The result of which would be a panic.

Extended runs of ztest(8) have exposed other possible races which
can occur when using 'zfs remap'.  Several of these have been fixed
but there may be others which have not yet been encountered and
diagnosed.

Furthermore, the ability to manually remap a filesystem is no longer
particularly useful now that the removal code can map large chunks.
Coupled with the fact that explaining what this command does and why
it may be useful requires a detailed understanding of the internals
of device removal.  These are details users should not be bothered
with.

Therefore, the 'zfs remap' command is being disabled but not entirely
removed.  It may be removed in the future or potentially reworked
to address the issues described above.  Since 'zfs remap' has never
been part of a tagged release its removal is expected to have
minimal impact.

The ZTS tests have been updated to continue to exercise the command
to prevent atrophy, but it has been removed entirely from ztest(8).

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8238
2019-01-15 15:46:58 -08:00
Tom Caputi 5e7f3ace58 Fix zio leak in dbuf_read()
Currently, dbuf_read() may decide to create a zio_root which is
used as a parent for any child zios created in dbuf_read_impl().
However, if there is an error in dbuf_read_impl(), this zio is
never executed and ends up leaked. This patch simply ensures
that we always execute the root zio, even i it has no real work
to do.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8267
2019-01-15 12:23:40 -08:00
loli10K 7b02fae7a6 Verify .gitignore entries
This change adds a make target 'vcscheck' which scans the git workspace
for new, untracked files missing from the .gitignore configuration; this
is done to help prevent adding unwanted build artifacts to the source
tree during development.

Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8281
2019-01-15 11:56:29 -08:00
Brian Behlendorf 9b626c126e Tag 0.8.0-rc3
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2019-01-14 12:40:42 -08:00
Brian Behlendorf d611989fdc Minor spelling corrections
Some minor spelling mistakes and typos.  No functional changes.

Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8272
2019-01-13 10:11:52 -08:00
Serapheim Dimitropoulos 61c3391acc Serialize ZTHR operations to eliminate races
Adds a new lock for serializing operations on zthrs.
The commit also includes some code cleanup and
refactoring.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8229
2019-01-13 10:09:46 -08:00
Paul Zuchowski 83c796c5e9 zfs filesystem skipped by df -h
On full pool when pool root filesystem references very few bytes,
the f_blocks returned to statvfs is 0 but should be at least 1.

Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #8253 
Closes #8254
2019-01-13 10:06:13 -08:00
Tom Caputi a13392a6a1 Add contrib/pyzfs/setup.py to .gitignore
As of 9ef798b77, setup.py is now generated from setup.py.in, but
this file was never moved to the .gitignore. This patch simply
corrects this issue.

Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8268
2019-01-13 10:04:38 -08:00
Brian Behlendorf e34cd80d79 Add pyzfs BuildRequires for mock(1)
When building pyzfs under mock the python-cffi and python-setuptools
packages need to be installed and have been added to the BuildRequires.

Reviewed-by: Neal Gompa <ngompa@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8265
2019-01-13 10:02:33 -08:00
Brian Behlendorf 99b0b5bc3f ZTS: zpool_resilver_restart
Since the vdev initialize feature was integrated the ZTS
zpool_resilver_restart test has been hitting its internal
timeout more frequently.  This happens most often on
the coverage builder but not exclusively.  Increasing the
timeout for this test case prevents any false positives.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8273
2019-01-13 10:01:31 -08:00
Brian Behlendorf 6955b40138 Provide more flexible object allocation interface
Object allocation performance can be improved for complex operations
by providing an interface which returns the newly allocated dnode.
This allows the caller to immediately use the dnode without incurring
the expense of looking up the dnode by object number.

The functions dmu_object_alloc_hold(), zap_create_hold(), and
dmu_bonus_hold_by_dnode() were added for this purpose.

The zap_create_* functions have been updated to take advantage of
this new functionality.  The dmu_bonus_hold_impl() function should
really have never been included in sys/dmu.h and was removed.
It's sole caller was converted to use dmu_bonus_hold_by_dnode().

The new symbols have been exported for use by Lustre.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8015
2019-01-10 14:37:43 -08:00
Tom Caputi 58769a4ebd Don't allow dnode allocation if dn_holds != 0
This patch simply fixes a small bug where dnode_hold_impl() could
attempt to allocate a dnode that was in the process of being freed,
but which still had active references. This patch simply adds the
required check.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8249
2019-01-10 14:36:23 -08:00
Gregor Kopka 8bd2a2866c Removed suggestion to use root dataset as bootfs
The dracut howto proposed to boot from the root dataset of a pool.
Apart from this giving problems when booting (as the code seems to
expect a child dataset and creates an illegal dataset name when using
the root dataset) the technical limitations of the root dataset
(among others the inability to rename or destroy through the `zfs`
command) resulted in the general consensus to only use it as a
container for the datasets in the pool - not as a filesystem itself.

Removed the idea to boot from the root dataset.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: Gregor Kopka <gregor@kopka.net>
Closes #8247
2019-01-08 16:15:30 -08:00
Neal Gompa (ニール・ゴンパ) 9ef798b771 Use ZFS version for pyzfs & drop unused reqs file
Now that 'pyzfs' is part of the ZFS codebase, it should be
versioned the same as the rest of the source tree. This eliminates
confusion on what version of the bindings are being used, especially
for dependent Python projects that may use the Python dist metadata
to identify compatible versions of pyzfs to work from.

In addition, a trivial change to drop the unused requirements.txt
file is included, simply because it's unused and a leftover from
before it was imported into the ZFS codebase and wired into the
autotools build scripts.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Neal Gompa <ngompa@datto.com>
Closes #8243
2019-01-08 15:56:42 -08:00
loli10K 0f5f23869a zfs receive and rollback can skew filesystem_count
This commit fixes a small issue which causes both zfs receive and
rollback operations to incorrectly increase the "filesystem_count"
property value.

This change also adds a new test group "limits" to the ZFS Test Suite
to exercise both filesystem_count/limit and snapshot_count/limit
functionality.

Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8232
2019-01-08 10:17:46 -08:00
asomers f384c045d8 OpenZFS 8473 - scrub does not detect errors on active spares
Scrubbing is supposed to detect and repair all errors in the pool.
However, it wrongly ignores active spare devices. The problem can
easily be reproduced in OpenZFS at git rev 0ef125d with these
commands:

    truncate -s 64m /tmp/a /tmp/b /tmp/c
    sudo zpool create testpool mirror /tmp/a /tmp/b spare /tmp/c
    sudo zpool replace testpool /tmp/a /tmp/c
    /bin/dd if=/dev/zero bs=1024k count=63 oseek=1 conv=notrunc of=/tmp/c
    sync
    sudo zpool scrub testpool
    zpool status testpool # Will show 0 errors, which is wrong
    sudo zpool offline testpool /tmp/a
    sudo zpool scrub testpool
    zpool status testpool # Will show errors on /tmp/c,
                          # which should've already been fixed

FreeBSD head is partially affected: the first scrub will detect
some errors, but the second scrub will detect more.  This same
test was run on Linux before applying the fix and the FreeBSD
head behavior was observed.

Authored by: asomers <asomers@FreeBSD.org>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Sponsored by: Spectra Logic Corp

OpenZFS-issue: https://www.illumos.org/issues/8473
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/e20ec8879
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/554675ee
Closes #8251
2019-01-08 09:51:30 -08:00
Neal Gompa (ニール・ゴンパ) 53b5fcd365 Include third party licenses in dist tarballs
Since the merge of the Linux Solaris Porting Layer source tree into
the ZFS codebase, ZFS is now a double-licensed codebase, with the
former SPL codebase retaining its license (GPLv2+) within the ZFS
source tree.

However, the license files for SPL were not being included in the
tarballs generated by autotools. This change corrects that.

In addition, all the other third party licenses in the codebase are
now properly declared to be included in the dist tarballs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Neal Gompa <ngompa@datto.com>
Closes #8242
2019-01-08 09:29:34 -08:00
Tony Hutter 21e000ad3f Fix missing dkms modules after upgrades
If you were upgrading from say, fc28->fc29, on ZFS version X, the RPMs
macros would get called like this:

%post X.fc29
   - This is the step where fc29 gets built by dkms.
     As part of the build, dkms automatically removes the previous
     modules before building the new ones.  It then builds the new
     modules.
%preun X.fc28
   - Right before this step, X.fc29 is be built and installed, but
     since it has the same X, it's files get inadvertently removed
     by fc28's uninstall.
%postun X.fc28

This patch updates %preun X.fc28 to see if we're upgrading or
uninstalling.  If we're uninstalling, then remove our files. If we're
upgrading then do nothing, since will know dkms will have already
removed our files in %post X.fc29.

Note that since this fixes the %preun step, it's effect isn't going
to be noticed immediately.  It will only be seen when packages
with this fix are upgraded to a newer version.

Reviewed-by: Ralf Ertzinger <ralf@skytale.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #6902 
Closes #8216
2019-01-08 09:26:45 -08:00
Neal Gompa (ニール・ゴンパ) 4efb48eecb Bump commit subject length to 72 characters
There's not really a reason to keep the subject length so short,
since the reason to make it this short was for making nice renders
of a summary list of the git log. With 72 characters, this still
works out fine, so let's just raise it to that so that it's easier
to give slightly more descriptive change summaries.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Neal Gompa <ngompa@datto.com>
Closes #8250
2019-01-08 09:23:05 -08:00
Benjamin Gentil 22448f0894 zfs.8 uses wrong snapshot names in Example 15
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: Benjamin Gentil <benjamin@gentil.io>
Closes #8241
2019-01-07 11:08:54 -08:00
Brian Behlendorf a769fb53a1 Add 'zpool status -i' option
Only display the full details of the vdev initialization state
in 'zpool status' output when requested with the -i option.
By default display '(initializing)' after vdevs when they are
being actively initialized.  This is consistent with the
established precident of appending '(resilvering), etc' and
fits within the default 80 column terminal width making it
easy to read.

Additionally, updated the 'zpool initialize' documentation to
make it clear the options are mutually exclusive, but allow
duplicate options like all other zfs/zpool commands.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8230
2019-01-07 11:03:18 -08:00
George Wilson c10d37dd9f zfs initialize performance enhancements
PROBLEM
========

When invoking "zpool initialize" on a pool the command will
create a thread to initialize each disk. Unfortunately, it does
this serially across many transaction groups which can result
in commands taking a long time to return to the user and may
appear hung. The same thing is true when trying to suspend/cancel
the operation.

SOLUTION
=========

This change refactors the way we invoke the initialize interface
to ensure we can start or stop the intialization in just a few
transaction groups.

When stopping or cancelling a vdev initialization perform it
in two phases.  First signal each vdev initialization thread
that it should exit, then after all threads have been signaled
wait for them to exit.

On a pool with 40 leaf vdevs this reduces the vdev initialize
stop/cancel time from ~10 minutes to under a second.  The reason
for this is spa_vdev_initialize() no longer needs to wait on
multiple full TXGs per leaf vdev being stopped.

This commit additionally adds some missing checks for the passed
"initialize_vdevs" input nvlist.  The contents of the user provided
input "initialize_vdevs" nvlist must be validated to ensure all
values are uint64s.  This is done in zfs_ioc_pool_initialize() in
order to keep all of these checks in a single location.

Updated the innvl and outnvl comments to match the formatting used
for all other new sytle ioctls.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #8230
2019-01-07 11:03:08 -08:00
George Wilson 619f097693 OpenZFS 9102 - zfs should be able to initialize storage devices
PROBLEM
========

The first access to a block incurs a performance penalty on some platforms
(e.g. AWS's EBS, VMware VMDKs). Therefore we recommend that volumes are
"thick provisioned", where supported by the platform (VMware). This can
create a large delay in getting a new virtual machines up and running (or
adding storage to an existing Engine). If the thick provision step is
omitted, write performance will be suboptimal until all blocks on the LUN
have been written.

SOLUTION
=========

This feature introduces a way to 'initialize' the disks at install or in the
background to make sure we don't incur this first read penalty.

When an entire LUN is added to ZFS, we make all space available immediately,
and allow ZFS to find unallocated space and zero it out. This works with
concurrent writes to arbitrary offsets, ensuring that we don't zero out
something that has been (or is in the middle of being) written. This scheme
can also be applied to existing pools (affecting only free regions on the
vdev). Detailed design:
        - new subcommand:zpool initialize [-cs] <pool> [<vdev> ...]
                - start, suspend, or cancel initialization
        - Creates new open-context thread for each vdev
        - Thread iterates through all metaslabs in this vdev
        - Each metaslab:
                - select a metaslab
                - load the metaslab
                - mark the metaslab as being zeroed
                - walk all free ranges within that metaslab and translate
                  them to ranges on the leaf vdev
                - issue a "zeroing" I/O on the leaf vdev that corresponds to
                  a free range on the metaslab we're working on
                - continue until all free ranges for this metaslab have been
                  "zeroed"
                - reset/unmark the metaslab being zeroed
                - if more metaslabs exist, then repeat above tasks.
                - if no more metaslabs, then we're done.

        - progress for the initialization is stored on-disk in the vdev’s
          leaf zap object. The following information is stored:
                - the last offset that has been initialized
                - the state of the initialization process (i.e. active,
                  suspended, or canceled)
                - the start time for the initialization

        - progress is reported via the zpool status command and shows
          information for each of the vdevs that are initializing

Porting notes:
- Added zfs_initialize_value module parameter to set the pattern
  written by "zpool initialize".
- Added zfs_vdev_{initializing,removal}_{min,max}_active module options.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/9102
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c3963210eb
Closes #8230
2019-01-07 10:37:26 -08:00
Brian Behlendorf c87db59196 Python 2 and 3 compatibility
With Python 2 (slowly) approaching EOL and its removal from distribitions
already being planned (Fedora), the existing Python 2 code needs to be
transitioned to Python 3.  This patch stack updates the Python code to
be compatible with Python 2.7, 3.4, 3.5, 3.6, and 3.7.

Reviewed-by: John Ramsden <johnramsden@riseup.net>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #8096
2019-01-06 10:54:12 -08:00
John Wren Kennedy dffce3c282 test-runner: python3 support
Updated to be compatible with Python 2.6, 2.7, 3.5 or newer.

Reviewed-by: John Ramsden <johnramsden@riseup.net>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com>
Closes #8096
2019-01-06 10:39:41 -08:00
Brian Behlendorf 530248d1aa arc_summary: consolidate test case
Since we're only installing one version of arc_summary we only
need one test case.  Update the test to determine which version
is available and then test its supported flags.

Remove files for misc tests which should have been cleaned up.

Reviewed-by: John Ramsden <johnramsden@riseup.net>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8096
2019-01-06 10:39:41 -08:00
Brian Behlendorf 6e72a5b9b6 pyzfs: python3 support (build system)
Almost all of the Python code in the respository has been updated
to be compatibile with Python 2.6, Python 3.4, or newer.  The only
exceptions are arc_summery3.py which requires Python 3, and pyzfs
which requires at least Python 2.7.  This allows us to maintain a
single version of the code and support most default versions of
python.  This change does the following:

* Sets the default shebang for all Python scripts to python3.  If
  only Python 2 is available, then at install time scripts which
  are compatible with Python 2 will have their shebangs replaced
  with /usr/bin/python.  This is done for compatibility until
  Python 2 goes end of life.  Since only the installed versions
  are changed this means Python 3 must be installed on the system
  for test-runner when testing in-tree.

* Added --with-python=<2|3|3.4,etc> configure option which sets
  the PYTHON environment variable to target a specific python
  version.  By default the newest installed version of Python
  will be used or the preferred distribution version when
  creating pacakges.

* Fixed --enable-pyzfs configure checks so they are run when
  --enable-pyzfs=check and --enable-pyzfs=yes.

* Enabled pyzfs for Python 3.4 and newer, which is now supported.

* Renamed pyzfs package to python<VERSION>-pyzfs and updated to
  install in the appropriate site location.  For example, when
  building with --with-python=3.4 a python34-pyzfs will be
  created which installs in /usr/lib/python3.4/site-packages/.

* Renamed the following python scripts according to the Fedora
  guidance for packaging utilities in /bin

  - dbufstat.py     -> dbufstat
  - arcstat.py      -> arcstat
  - arc_summary.py  -> arc_summary
  - arc_summary3.py -> arc_summary3

* Updated python-cffi package name.  On CentOS 6, CentOS 7, and
  Amazon Linux it's called python-cffi, not python2-cffi.  For
  Python3 it's called python3-cffi or python3x-cffi.

* Install one version of arc_summary.  Depending on the version
  of Python available install either arc_summary2 or arc_summary3
  as arc_summary.  The user output is only slightly different.

Reviewed-by: John Ramsden <johnramsden@riseup.net>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8096
2019-01-06 10:39:41 -08:00
Brian Behlendorf 4b1c4062d0 pyzfs: python3 support (unit tests)
* Updated unit tests to be compatbile with python 2 or 3.  In most
  cases all that was required was to add the 'b' prefix to existing
  strings to convert them to type bytes for python 3 compatibility.

* There were several places where the python version need to be
  checked to remain compatible with pythong 2 and 3.  Some one
  more seasoned with Python may be able to find a way to rewrite
  these statements in a compatible fashion.

Reviewed-by: John Ramsden <johnramsden@riseup.net>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8096
2019-01-06 10:39:41 -08:00
Brian Behlendorf e5fb1dc586 pyzfs: python3 support (library 2/2)
* All pool, dataset, and nvlist keys must be of type bytes.

Reviewed-by: John Ramsden <johnramsden@riseup.net>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8096
2019-01-06 10:39:41 -08:00
Antonio Russo 9de8c0cd7f pyzfs: python3 support (library 1/2)
These changes are efficient and valid in python 2 and 3. For the
most part, they are also pythonic.

* 2to3 conversion
* add __future__ imports
* iterator changes
* integer division
* relative import fixes

Reviewed-by: John Ramsden <johnramsden@riseup.net>
Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #8096
2019-01-06 10:39:41 -08:00
Brian Behlendorf 0b8e4418b6 Add zfs module feature and property compatibility
This change is required to ease the transition when upgrading
from 0.7.x to 0.8.x.  It allows 0.8.x user space utilities to
remain compatible with 0.7.x and older kernel modules.

Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8231
2019-01-03 15:19:37 -08:00
bunder2015 5365b0747a Add missing MMP status code to libzfs_status
When MMP was merged the status codes in libzfs_status were not
updated to add the status code for ZPOOL_STATUS_IO_FAILURE_MMP.  This
commit corrects this and adds comments to help keep track of which
code is used for which status.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #8148
Closes #8222
2019-01-03 12:15:46 -08:00
Brian Behlendorf 65ca2c1eb9 Fix 'zpool remap' freeing race
The dmu_objset_remap_indirects_impl() logic depends on dnode_hold()
returning ENOENT for dnodes which will be freed and should be skipped.

This behavior can only be relied upon when taking a new hold and
while the caller has an open transaction.  This ensures that the
open txg cannot advance and that a concurrent free will end up
in the same txg (which is critical).  Relying on an existing hold
will not prevent dnode_free() from succeeding.

The solution is to take an additional dnode_hold() after assigning
the transaction.  This ensures the remap will never dirty the dnode
if it was freed while we were waiting in dmu_tx_assign(, TXG_WAIT).

Randomly set zfs_object_remap_one_indirect_delay_ms in ztest.  This
increases the likelihood of an operation racing with the remap.
Converted from ticks to milliseconds.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8215
2019-01-02 11:46:04 -08:00
Richard Laager 06f3fc2a4b Minor tweaks to zfs.8 man page for POSIX ACLs
* Capitalize POSIX in POSIX ACLs.  This change makes the POSIX 
  in POSIX ACLs all caps, which is both correct and consistent with
  the rest of the man page.

* Slightly reword part of zfs.8.  I tweaked a sentence to add a 
  missing comma, and as long as I was editing, removed a couple
  unnecessary words.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #8220
2018-12-26 13:50:14 -08:00
Brad Lewis 3ec34e5527 OpenZFS 9284 - arc_reclaim_thread has 2 jobs
Following the fix for 9018 (Replace kmem_cache_reap_now() with
kmem_cache_reap_soon), the arc_reclaim_thread() no longer blocks
while reaping.  However, the code is still confusing and error-prone,
because this thread has two responsibilities.  We should instead
separate this into two threads each with their own responsibility:

 1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which
    improves `arc_is_overflowing()`

 2. keep enough free memory in the system, by calling
    `arc_kmem_reap_now()` plus `arc_shrink()`, which improves
    `arc_available_memory()`.

Furthermore, we can use the zthr infrastructure to separate the
"should we do something" from "do it" parts of the logic, and
normalize the start up / shut down of the threads.

Authored by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by:  Brad Lewis <brad.lewis@delphix.com>
Signed-off-by: Brad Lewis <brad.lewis@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/9284
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/de753e34f9
Closes #8165
2018-12-26 13:22:28 -08:00
Tom Caputi 00f198de6b Fix zfs_dirty_data_sync_percent documentation
In dfbe2675 zfs_dirty_data_sync was changed to a new tunable named
zfs_dirty_data_sync_percent. Unfortunately, the module parameter
documentation is the code was not updated accordingly. This patch
simply corrects that.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8212
2018-12-18 14:47:33 -08:00
Tony Hutter c66401fae0 Add enclosure_symlinks option to vdev_id
Add an 'enclosure_symlinks' option to vdev_id.conf.  This creates
consistently named symlinks to the enclosure devices (/dev/sg*) based
off the configuration in vdev_id.conf.  The enclosure symlinks show
up in /dev/by-enclosure/<prefix>-<channel><num>.  The links make it
make it easy to run sg_ses on a particular enclosure device.  The
enclosure links are created in addition to the normal
/dev/disk/by-vdev links.

'enclosure_symlinks' is only valid in sas_direct configurations.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Simon Guest <simon.guest@tesujimath.org>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8194
2018-12-14 17:27:49 -08:00
Tom Caputi 7c46894081 ZTS: fix wait_scrubbed()
Currently, wait_scrubbed() is the only function of its kind that
accepts a timeout, which is 10s by default. This timeout is pretty
short for a scrub and causes test failures if we run too long. This
patch removes the timeout, instead leaning on the global test suite
timeout to ensure the tests keep moving.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8210
2018-12-14 10:06:49 -08:00
Tom Caputi 2a6078450d Fix zap_update() ASSERT from ztest
This patch simply removes an invalid assert from the zap_update()
function. The ASSERT is invalid because it does not hold the zap
lock from the time it fetches the old value to the time it confirms
that it is what it should be.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8209
2018-12-14 10:04:11 -08:00
Brian Behlendorf 0dd6b6bfcb ztest: ENOSPC in ztest_objset_destroy_cb()
While unlikely it is possible for dsl_destroy_head() to return
ENOSPC in the ztest_objset_destroy_cb().  This can occur even
when ZFS_SPACE_CHECK_DESTROY is used with the dsl_sync_task().
Both the existence of a checkpoint and pending deferred frees
can cause this.

Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8206
2018-12-14 10:03:05 -08:00
Paul Dagnelie 98d07d5798 OpenZFS 9559 - zfs diff handles files on delete queue in fromsnap poorly
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Joshua M. Clulow <josh@sysmgr.org>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9559
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d7e45412
Closes #8211
2018-12-14 09:50:49 -08:00
Andriy Gapon dc1c630b8a OpenZFS 9630 - add lzc_rename and lzc_destroy to libzfs_core
Porting Notes:
* Additional changes to recv_rename_impl() were required due to
  encryption code not being merged in OpenZFS yet.
* libzfs_core python bindings (pyzfs) were updated to fully support
  both lzc_rename() and lzc_destroy()

Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: loli10K <ezomori.nozomu@gmail.com>

OpenZFS-issue: https://www.illumos.org/issues/9630
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/049ba63
Closes #8207
2018-12-14 09:49:45 -08:00
Ben Cordero eff7d78f8a Add cut binary to the initramfs
Since the `cut -b` command is used by `parse-zfs.sh`,
ensure that it is copied to the initramfs.

Fix spl_hostid when set by cmdline. This follows a
similar logic from the `zgenhostid` script, using `echo`
instead of `printf`.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ben Cordero <bencord0@condi.me>
Closes #8197
2018-12-13 15:48:46 -08:00
Tom Caputi 5aa95ba0d3 Fix resilver writes in vdev_indirect_io_start
This patch addresses an issue found in ztest where resilver
write zios that were passed to an indirect vdev would end up
being handled as though they were resilver read zios. This
caused issues where the zio->io_abd would be both read to
and written from at the same time, causing asserts to fail.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8193
2018-12-13 14:18:48 -08:00
Brian Behlendorf 4b70290163 Check for strlcat and strlcpy
This partially reverts commit 8005ca4 by moving the strlcat()
and strlcpy() compatibility implementations back to their original
location.

In addition, these two functions were added to the AC_CHECK_FUNCS
macro. When these functions are available from the C library,
HAVE_STRLCAT and HAVE_STRLCPY will be defined and library version
used. Otherwise the compatibility version is built.

Reviewed-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8157 
Closes #8202
2018-12-11 16:01:41 -08:00
Richard Elling a48cd034c8 Seeing negative values for wlentime and rlentime
Linux kstat IO and TIMER printed values as signed. However the counters
only increment. Thus humans looking at the data can be confused when
the counters roll over.

Note: The recommended use of these values is to monitor the derivative,
which don't really care about the sign. See explanations related to
non-negative derivatives in the various time-series databases.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #8131 
Closes #8198
2018-12-11 13:56:54 -08:00
Olaf Faaland fa61e72340 Rename macro ZFS_MINOR due to Lustre conflict
Macro ZFS_MINOR, introduced in commit a6cc9756 to record the chosen
static minor number for /dev/zfs, conflicts with an existing macro
in Lustre.  The lustre macro (along with _MAJOR, _PATCH, _FIX) is
used to record the zfsonlinux version Lustre is being built against.

Since the Lustre macro came first, and is used in past versions of
lustre at least going back to 2.10, it makes sense to rename the
macro in ZFS instead of doing so in Lustre which would require
backporting the patch.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #8195
2018-12-10 16:52:49 -08:00
Prakash Surya 900d09b285 OpenZFS 9962 - zil_commit should omit cache thrash
As a result of the changes made in 8585, it's possible for an excessive
amount of vdev flush commands to be issued under some workloads.

Specifically, when the workload consists of mostly async write activity,
interspersed with some sync write and/or fsync activity, we can end up
issuing more flush commands to the underlying storage than is actually
necessary. As a result of these flush commands, the write latency and
overall throughput of the pool can be poorly impacted (latency
increases, throughput decreases).

Currently, any time an lwb completes, the vdev(s) written to as a result
of that lwb will be issued a flush command. The intenion is so the data
written to that vdev is on stable storage, prior to communicating to any
waiting threads that their data is safe on disk.

The problem with this scheme, is that sometimes an lwb will not have any
threads waiting for it to complete. This can occur when there's async
activity that gets "converted" to sync requests, as a result of calling
the zil_async_to_sync() function via zil_commit_impl(). When this
occurs, the current code may issue many lwbs that don't have waiters
associated with them, resulting in many flush commands, potentially to
the same vdev(s).

For example, given a pool with a single vdev, and a single fsync() call
that results in 10 lwbs being written out (e.g. due to other async
writes), that will result in 10 flush commands to that single vdev (a
flush issued after each lwb write completes). Ideally, we'd only issue a
single flush command to that vdev, after all 10 lwb writes completed.

Further, and most important as it pertains to this change, since the
flush commands are often very impactful to the performance of the pool's
underlying storage, unnecessarily issuing these flush commands can
poorly impact the performance of the lwb writes themselves. Thus, we
need to avoid issuing flush commands when possible, in order to acheive
the best possible performance out of the pool's underlying storage.

This change attempts to address this problem by changing the ZIL's logic
to only issue a vdev flush command when it detects an lwb that has a
thread waiting for it to complete. When an lwb does not have threads
waiting for it, the responsibility of issuing the flush command to the
vdevs involved with that lwb's write is passed on to the "next" lwb.
It's only once a write for an lwb with waiters completes, do we issue
the vdev flush command(s). As a result, now when we issue the flush(s),
we will issue them to the vdevs involved with that specific lwb's write,
but potentially also to vdevs involved with "previous" lwb writes (i.e.
if the previous lwbs did not have waiters associated with them).

Thus, in our prior example with 10 lwbs, it's only once the last lwb
completes (which will be the lwb containing the waiter for the thread
that called fsync) will we issue the vdev flush command; all of the
other lwbs will find they have no waiters, so they'll pass the
responsibility of the flush to the "next" lwb (until reaching the last
lwb that has the waiter).

Porting Notes:
* Reconciled conflicts with the fastwrite feature.

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9962
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/545190c6
Closes #8188
2018-12-07 11:09:42 -08:00
Prakash Surya 53b1f5eac6 OpenZFS 9963 - Separate tunable for disabling ZIL vdev flush
Porting Notes:
* Add options to zfs-module-parameters(5) man page.
* zfs_nocacheflush move to vdev.c instead of vdev_disk.c, since
  the latter doesn't get built for user space.

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9963
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f8fdf68125
Closes #8186
2018-12-07 11:06:29 -08:00
George Wilson 18b14b17c8 OpenZFS 9993 - zil writes can get delayed in zio pipeline
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9993
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2258ad0b
Closes #8185
2018-12-07 11:05:35 -08:00
Andy Fiddaman e63ac16d25 OpenZFS 9880 - Race in ZFS parallel mount
Porting Notes:
* Not required for Linux since the zone is always global.  But
  we'll want this change if we start using the zones code.

Authored by: Andy Fiddaman <omnios@citrus-it.co.uk>
Reviewed by: Jason King <jason.king@joyent.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9880
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bc4c0ff134
Closes #8189
2018-12-07 11:02:23 -08:00
Tom Caputi 4b611761bd Fix error message when zfs module is not loaded
This patch corrects a small issue where the wrong error message
was being displayed when the zfs kernel module was not loaded.
This also avoids waiting for the (by default) 10s timeout to see
if the /dev/zfs device appears.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8187
2018-12-07 10:54:38 -08:00
Tony Nguyen ef57371a92 Do not enable stack tracer for ZFS performance test
Linux ZFS test suite runs with /proc/sys/kernel/stack_tracer_enabled=1,
via zfs.sh script, which has negative performance impact, up to 40%.

Since large stack is a rare issue now, preferred behavior would be:
- making stack tracer an opt-in feature for zfs.sh
- zfs-test.sh enables stack tracer only when requested

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
#8173
2018-12-07 10:51:42 -08:00
Tom Caputi d6496040d9 Ensure dsl scan prefetch queue is emptied
This patch simply ensures that scn->scn_prefetch_queue is emptied
before the kernel module is unloaded and when scanning completes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8178
2018-12-06 09:47:23 -08:00
loli10K b53cb02d92 Fix 'zfs receive -F' message when destination has snapshots
When receiving a send stream with forced rollback on a dataset with
snapshots zfs suggests said snapshots must be removed to successfully
receive the stream; however the message is misleading because it
prints the dataset name instead of one of its snapshots.

   $ sudo zfs snap pp/recvfs@snap-orig
   $ sudo zfs recv -F pp/recvfs < sendstream
   cannot receive new filesystem stream: destination has snapshots (eg. pp/recvfs)
   must destroy them to overwrite it

This change simply restores the snapshot name in the error message.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8167
2018-12-05 09:33:52 -08:00
Ben Wolsieffer 2aa398f3aa Use autoconf variable for C preprocessor
This fixes the build when cross-compiling, where the preprocessor might
be prefixed.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ben Wolsieffer <benwolsieffer@gmail.com>
Closes #8180
2018-12-05 09:31:44 -08:00
Tom Caputi e3c85c0938 Move assert in dump_dir() in zdb
This one line patch moves an assert in the function dump_dir()
below an error check that ensures it ran correctly. This ensures
zdb dumps the error that actually caused the problem, as opposed
to one of its symptoms.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8171
2018-12-05 09:30:28 -08:00
Brian Behlendorf 78e2139467 Fix dnode_hold() freeing dnode behavior
Commit 4c5b89f59 refactored dnode_hold() and in the process
accidentally introduced a slight change in behavior which was
not intended.  The required behavior is that once the ZPL,
or other consumer, declares its intent to free a dnode then
dnode_hold() should immediately start failing.  This updated
code wouldn't return the failure until after it was freed.

When DNODE_MUST_BE_ALLOCATED is set it must return ENOENT, and
when DNODE_MUST_BE_FREE is set it must return EEXIST;

This issue was uncovered by ztest_remap() which attempted
to remap a freeing object which should have been skipped as
described by the comment in dmu_objset_remap_indirects_impl().

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8172
2018-12-05 09:29:33 -08:00
Brian Behlendorf c5eea0ab9c Fix 'zpool list -v' alignment
The verbose output of 'zpool list' was not correctly aligned due
to differences in the vdev name lengths.  Minimally update the
code the correct the alignment using the same strategy employed
by 'zpool status'.

Missing dashes were added for the empty defaults columns, and
the vdev state is now printed for all vdevs.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7308 
Closes #8147
2018-12-04 10:17:54 -08:00
TerraTech a0cc3726ed zfs-functions.in: is_mounted() always returns 1
The 'while read line; ...; done' loop is run in a piped subshell 
therefore the 'return 0' would not cause a return from the 
is_mounted() function.  In all cases, this function will 
always return 1.

The fix is to 'return 1' from the subshell on a successful match 
(no match == return 0), and then negating the final return value.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: TerraTech <TerraTech@users.noreply.github.com>
Closes #8151
2018-12-04 09:57:29 -08:00
Tom Caputi fedef6dd59 Fix ztest deadlock in spa_vdev_remove()
This patch corrects an issue where spa_vdev_remove() would
call spa_history_log_internal() while holding the spa config
lock. This function may decide to block until the next txg if
the current one seems too full. However, since the thread is
holding the config log, the txg sync thread cannot progress
and the system ends up deadlocked. This patch simply moves
all calls to spa_history_log_internal() outside of the config
lock.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8162
2018-12-04 09:54:05 -08:00
Tom Caputi 0b606cb33f Fix ztest deadlock in ztest_zil_remount()
This patch fixes a small race condition in ztest_zil_remount()
that could result in a deadlock. ztest_device_removal() calls
spa_vdev_remove() which may eventually call spa_reset_logs().
If ztest_zil_remount() attempts to call zil_close() while this
is happening, it may fail when it asserts !zilog_is_dirty(zilog).
This patch simply adds locking to correct the issue.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8154
2018-12-04 09:43:31 -08:00
LOLi bdbd5477bc Fix ASSERT in zfs_receive_one()
This commit fixes the following ASSERT in zfs_receive_one() when
receiving a send stream from a root dataset with the "-e" option:

    $ sudo zfs snap source@snap
    $ sudo zfs send source@snap | sudo zfs recv -e destination/recv
    chopprefix > drrb->drr_toname
    ASSERT at libzfs_sendrecv.c:3804:zfs_receive_one()

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8121
2018-12-04 09:38:55 -08:00
Brian Behlendorf 7c9a42921e Detect IO errors during device removal
* Detect IO errors during device removal

While device removal cannot verify the checksums of individual
blocks during device removal, it can reasonably detect hard IO
errors from the leaf vdevs.  Failure to perform this error
checking can result in device removal completing successfully,
but moving no data which will permanently corrupt the pool.

Situation 1: faulted/degraded vdevs

In the configuration shown below, the removal of mirror-0 will
permanently corrupt the pool.  Device removal will preferentially
copy data from 'vdev1 -> vdev3' and from 'vdev2 -> vdev4'.  Which
in this case will result in nothing being copied since one vdev
in each of those groups in unavailable.  However, device removal
will complete successfully since all IO errors are ignored.

  tank                DEGRADED     0     0     0
    mirror-0          DEGRADED     0     0     0
      /var/tmp/vdev1  FAULTED      0     0     0  external fault
      /var/tmp/vdev2  ONLINE       0     0     0
    mirror-1          DEGRADED     0     0     0
      /var/tmp/vdev3  ONLINE       0     0     0
      /var/tmp/vdev4  FAULTED      0     0     0  external fault

This issue is resolved by updating the source child selection
logic to exclude unreadable leaf vdevs.  Additionally, unwritable
destination child vdevs which can never succeed are skipped to
prevent generating a large number of write IO errors.

Situation 2: individual hard IO errors

During removal if an unexpected hard IO error is encountered when
either reading or writing the child vdev the entire removal
operation is cancelled.  While it may be possible to reconstruct
the data after removal that cannot be guaranteed.  The only
strictly safe thing to do is to cancel the removal.

As a future improvement we may want to instead suspend the removal
process and allow the damaged region to be retried.  But that work
is left for another time, hard IO errors during the removal process
are expected to be exceptionally rare.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6900
Closes #8161
2018-12-04 09:37:37 -08:00
Tom Caputi c40a1124e1 Fix consistency of ztest_device_removal_active
ztest currently uses the boolean flag ztest_device_removal_active
to protect some tests that may not run successfully if they occur
at the same time as ztest_device_removal(). Unfortunately, in the
event that ztest is in the middle of a device removal when it
decides to issue a SIGKILL, the device removal will be
automatically restarted (without setting the flag) when the pool
is re-imported on the next run. This patch corrects this by
ensuring that any in-progress removals are completed before running
further tests after the re-import.

This patch also makes a few small changes to prevent race conditions
involving the creation and destruction of spa->spa_vdev_removal,
since this field is not protected by any locks. Some checks that
may run concurrently with setting / unsetting this field have been
updated to check spa->spa_removing_phys.sr_state instead. The most
significant change here is that spa_removal_get_stats() no longer
accounts for in-flight work done, since that could result in a NULL
pointer dereference.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8105
2018-11-28 20:47:09 -08:00
LOLi c71c8c715b zfs_dbgmsg() is not safe from every context
This commit reverts to using printk() instead of zfs_dbgmsg() to log
messages in vdev_disk_error(): this is necessary because the latter can
be called from interrupt context where we are not allowed to sleep.
Unfortunately zfs_dbgmsg() performs its allocations calling kmalloc()
with the KM_SLEEP flag which may result in the following oops:

   BUG: scheduling while atomic: swapper/4/0/0x10000100
	Call Trace:
	<IRQ>  [<0>] dump_stack+0x19/0x1b
	...
	[<0>] spl_kmem_alloc+0xdf/0x140 [spl] <-- kmem_alloc(size, KM_SLEEP)
	[<0>] __dprintf+0x69/0x150 [zfs]
	[<0>] ? kmem_cache_free+0x1e2/0x200
	[<0>] vdev_disk_error.part.15+0x5f/0x70 [zfs]
	[<0>] vdev_disk_io_flush_completion+0x48/0x70 [zfs]
	[<0>] bio_endio+0x67/0xb0
	[<0>] blk_update_request+0x90/0x360
	...
	[<0>] scsi_finish_command+0xdc/0x140
	[<0>] scsi_softirq_done+0x132/0x160
	[<0>] blk_done_softirq+0x96/0xc0
	[<0>] __do_softirq+0xf5/0x280
	[<0>] call_softirq+0x1c/0x30
	[<0>] do_softirq+0x65/0xa0
	[<0>] irq_exit+0x105/0x110
	[<0>] do_IRQ+0x56/0xf0
	[<0>] common_interrupt+0x162/0x162
	<EOI>  [<0>] ? cpuidle_enter_state+0x54/0xd0
	[<0>] cpuidle_idle_call+0xde/0x230
	[<0>] arch_cpu_idle+0xe/0xb0
	[<0>] cpu_startup_entry+0x14a/0x1e0
	[<0>] start_secondary+0x1f7/0x270
	[<0>] start_cpu+0x5/0x14

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8137 
Closes #8150
2018-11-28 11:29:57 -08:00
Tom Caputi cef48f14da Remove races from scrub / resilver tests
Currently, several tests in the ZFS Test Suite that attempt to
test scrub and resilver behavior occasionally fail. A big reason
for this is that these tests use a combination of zinject and
zfs_scan_vdev_limit to attempt to slow these operations enough
to verify their test commands. This method works most of the time,
but provides no guarantees and leads to flaky behavior. This patch
adds a new tunable, zfs_scan_suspend_progress, that ensures that
scans make no progress, guaranteeing that tests can be run without
racing.

This patch also changes zfs_remove_max_bytes_pause to match this
new tunable. This provides some consistency between these two
similar tunables and ensures that the tunable will not misbehave
on 32-bit systems.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8111
2018-11-28 10:12:08 -08:00
LOLi 00369f3338 ZTS: fix "not found" errors
This commit fixes several "not found" errors caused by calling undefined
or incorrect shell functions in the following ZFS Test Suite groups:

   * alloc_class
   * channel_program/lua_core
   * channel_program/synctask_core
   * cli_root/zpool_import
   * cli_user/misc

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8152
2018-11-27 09:39:37 -08:00
Rich Ercolani 62ee31adce Fix typo in update to zfs-module-parameters(5)
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #8153
2018-11-26 10:23:58 -08:00
Brian Behlendorf 8005ca4f74 Move strlcat, strlcpy, and strnlen
Move strlcat() and strlcpy() from .c source files in to the libspl
string.h header.  By changing these compatibility functions to static
inline functions they can included as needed without requiring linking
with the libspl.so library.

Remove strnlen() which is barely used in the source, and has been
provided by glibc since v2.10.

Finally, convert four instances of strncpy() to strlcpy() in
libzfs_input_check.c which were causing build warnings when compiling
with gcc 8.2.1.  For example:

  libzfs_input_check.c: In function ‘zfs_destroy’:
  libzfs_input_check.c:651:9: error: ‘strncpy’ specified bound \
      4096 equals destination size [-Werror=stringop-truncation]
    (void) strncpy(zc.zc_name, dataset, sizeof (zc.zc_name));
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8116
2018-11-20 10:37:49 -08:00
LOLi 0cd5c941d0 zpool: allow split with whole-disk devices
This change allows 'zpool split' to work with whole-disk devices and
updates the ZFS Test Suite with a new script to exercise this
functionality.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6643 
Closes #8133
2018-11-20 10:22:53 -08:00
Christian Schwarz bd9c195805 man/zfs.8: document 'received' property source
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Christian Schwarz <me@cschwarz.com>
Closes #8134
2018-11-20 09:59:43 -08:00
John Wren Kennedy 70621ff20e ZTS: Fix parsing of zpool status in checksum test
filetest_001_pos consumes the output using read -r, assigning each
field to a variable. The problem comes when a vdev is marked degraded,
which appends extra fields to the line. This causes the trailing text
to be treated as part of the `cksum` variable. Using awk instead of
read -r allows us to extract the checksum error count from the output
whether the vdev is degraded or not.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com>
Closes #8136
2018-11-20 09:51:42 -08:00
LOLi ebb8735901 ZTS: "checksum" test group needs "lscpu"
This change adds "lscpu" to the list of commands used by the ZFS Test
Suite: this is required by the "checksum" test group to read the CPU
frequency which is used in EdonR, Skein and SHA2 performance tests.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8139
2018-11-20 09:47:58 -08:00
Sebastien Roy a10d50f999 OpenZFS 8115 - parallel zfs mount
Porting Notes:
* Use thread pools (tpool) API instead of introducing taskq interfaces
  to libzfs.
* Use pthread_mutext for locks as mutex_t isn't available.
* Ignore alternative libshare initialization since OpenZFS-7955 is
  not present on zfsonlinux.

Authored by: Sebastien Roy <seb@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Authored by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Don Brady <don.brady@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/8115
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a3f0e2b569
Closes #8092
2018-11-15 11:33:58 -08:00
Brian Behlendorf af2e8411da Tag 0.8.0-rc2
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-11-12 11:57:15 -08:00
kpande eb1a0b6174 Allow spaces in pool names for cmdline argument
PR #8114 quoted the ${ENCRYPTIONROOT} parameter to ensure we don't
lose spaces when unlocking root filesystem in the off chance that 
it has a space in its name.

Unfortunately, dracut and initramfs-tools do not actually get the 
quotes from the cmdline. If we use root=ZFS="root pool/filesystem 
name" the script still only sees root=ZFS=root and no quotation 
marks.

Because + is a reserved character in ZFS, it's used as a 
placeholder for spaces in the kernel cmdline.  In this way,
root=ZFS=root+pool/filesystem+name will properly expand by 
replacing the character with sed (POSIX compliant method).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: Kash Pande <kash@tripleback.net>
Issue #8114 
Closes #8117
2018-11-11 18:23:11 -08:00
LOLi c8fd652ce7 Fix coverity defects: CID 184285
CID 184285: Read from pointer after free (USE_AFTER_FREE)

This patch fixes an use-after-free in vdev_config_generate_stats()
moving the kmem_free() call at the end of the function.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8120
2018-11-11 18:09:00 -08:00
Brian Behlendorf ecd3728b26 Fix systemd spec file macros
Ensure that the _unitdir, _presetdir, _modulesloaddir, and
_systemdgeneratordir macros are always defined.  If not set
them to the expected default values.  Pass all of these options
to ./configure and package the resulting files in those locations.

Additionally, set __brp_mangle_shebangs_exclude_from until the
conversion to Python 3 is complete so they may be built cleanly
under mock.

Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7567
Closes #8119
2018-11-11 18:06:36 -08:00
Garrett Fields 0500bfd0b9 Make initramfs-tools script encryption aware
Changed decrypt_fs zfs command to "load-key"
Plymouth case code based on "contrib/dracut/90zfs/zfs-lib.sh.in"
Systemd case based on "contrib/dracut/90zfs/zfs-load-key.sh.in"
Cleaned up misspelling of "available" throughout

Code style fixes
Single quote for ${ENCRYPTIONROOT}
Changed "${DECRYPT_CMD}"  to "eval ${DECRYPT_CMD}"

Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Garrett Fields <ghfields@gmail.com>
Closes #8093
2018-11-09 11:30:09 -08:00
loli10K d48091de81 zed: detect and offline physically removed devices
This commit adds a new test case to the ZFS Test Suite to verify ZED
can detect when a device is physically removed from a running system:
the device will be offlined if a spare is not available in the pool.

We implement this by using the existing libudev functionality and
without relying solely on the FM kernel module capabilities which have
been observed to be unreliable with some kernels.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #1537
Closes #7926
2018-11-09 11:17:24 -08:00
kpande 13c59bb76b Add quotations for ${ENCRYPTIONROOT}
Add quotations for ${ENCRYPTIONROOT} to avoid breaking systems
with a space in the name.

Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kash Pande <kash@tripleback.net>
Related-to: #8093 
Closes #8114
2018-11-09 09:32:01 -08:00
Tony Hutter ad796b8a3b Add zpool status -s (slow I/Os) and -p (parseable)
This patch adds a new slow I/Os (-s) column to zpool status to show the
number of VDEV slow I/Os. This is the number of I/Os that didn't
complete in zio_slow_io_ms milliseconds. It also adds a new parsable
(-p) flag to display exact values.

 	NAME         STATE     READ WRITE CKSUM  SLOW
 	testpool     ONLINE       0     0     0     -
	  mirror-0   ONLINE       0     0     0     -
 	    loop0    ONLINE       0     0     0    20
 	    loop1    ONLINE       0     0     0     0

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7756
Closes #6885
2018-11-08 16:47:24 -08:00
George Melikov 877d925a9e Update zfs_admin_snapshot value (disabled)
It's disabled by default, update code and tests to reflect
the documentation.

Minor cleanup in delegate_common.kshlib.

Reviewed-by: Gregor Kopka <gregor@kopka.net>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #7835 
Closes #8045
2018-11-08 16:17:12 -08:00
Tom Caputi d8244d34bd ZTS: Fix and reenable zfs_rename tests
zfs_rename_006_pos has been flaky in the past because it was
missing a call to block_device_wait to ensure the zvols it creates
are present before running dd. Whenever this this happened,
zfs_rename_009_neg would also fail because the first test would
leak a zvol clone that it did not know how to clean up. This patch
fixes the root cause and reenables the test. It also fixes some
minor grammar errors.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #5647 
Closes #5648 
Closes #8088
2018-11-07 16:59:27 -08:00
Paul Zuchowski c2bcfa71f4 ZTS: Fix test zfs_mount_006_pos
For Linux, place a file in the mount point folder so it will be
considered "busy".  Fix the while loop so it doesn't rm in
directories above the testdir.  Add Linux-specific code to test
overlay on|off.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #4990 
Closes #8081
2018-11-07 16:54:08 -08:00
Tony Hutter d7bda38c76 Add BuildRequires gcc, make, elfutils-libelf-devel
This adds a BuildRequires for gcc, make, and elfutils-libelf-devel
into our spec files.  gcc has been a packaging requirement for
awhile now:

https://fedoraproject.org/wiki/Packaging:C_and_C%2B%2B

These additional BuildRequires allow us to mock build in
Fedora 29.

Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  Tony Hutter <hutter2@llnl.gov>
Closes #8095
Closes #8102
2018-11-07 15:48:24 -08:00
Tom Caputi a2d88f778a Fix !zilog_is_dirty() assert during ztest
ztest occasionally hits an assert that !zilog_is_dirty() during
zil_close(). This is caused by an interaction between 2 threads.
First, ztest_run() waits for each test thread to complete and
closes the associated dataset as soon as the thread joins. At
the same time, the ztest_vdev_add_remove() test may attempt to
remove the slog, which will open, dirty, and reset the logs on
every dataset in the pool (including those of other threads).
This patch simply ensures that we always join all of the test
threads before closing any datasets.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8094
2018-11-07 15:46:50 -08:00
Tom Caputi 20eb30d08e Fix divide by zero during indirect split damage
This patch simply ensures that vdev_indirect_splits_damage()
cannot hit a divide by zero exception if a split has no
children with valid data. The normal reconstruction code
path in vdev_indirect_reconstruct_io_done() already has this
check.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8086
2018-11-07 15:44:56 -08:00
Tom Caputi fde25c0a87 Fix dirtying vdev config on with RO spa
This patch simply corrects an issue where vdev_dtl_reassess()
could attempt to dirty the vdev config even when the spa was
not elligable for writing.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8085
2018-11-07 15:44:14 -08:00
Tom Caputi f44ad9297d Replay logs before starting ztest workers
This patch ensures that logs are replayed on all datasets prior
to starting ztest workers. This ensures that the call to
vdev_offline() a log device in ztest_fault_inject() will not fail
due to the log device being required for replay.

This patch also fixes a small issue found during testing where
spa_keystore_load_wkey() does not check that the dataset specified
is an encryption root. This check was present in libzfs, however.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8084
2018-11-07 15:40:24 -08:00
Tom Caputi ac53e50f79 Fix vdev removal finishing race
This patch fixes a race condition where the end of
vdev_remove_replace_with_indirect(), which holds
svr_lock, would race against spa_vdev_removal_destroy(),
which destroys the same lock and is called asynchronously
via dsl_sync_task_nowait().

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #6900 
Closes #8083
2018-11-07 15:38:10 -08:00
Tom Caputi 4021ba4cfa Make vdev_set_deferred_resilver() recursive
vdev_clear() can call vdev_set_deferred_resilver() with a
non-leaf vdev to setup a deferred resilver. However, this
function is currently written to only handle leaf vdevs.
This bug was introduced with deferred resilvers in 80a91e74.
This patch makes this function recursive so that it can find
appropriate vdevs to resilver and set vdev_resilver_deferred
on them.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #7732
Closes #8082
2018-11-07 15:33:17 -08:00
Don Brady 95692927f2 Fix libudev dependency in libzutil
ZFS should be able to build without libudev installed. The recent
change for libzutil inadvertently broke that.  Make the libudev code
conditional in zutil_import.c to resolve the build failure.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #8097
2018-11-06 17:47:52 -08:00
LOLi f0f9786545 zpool: bogus error for invalid dedupditto value
When provided with an invalid 'dedupditto' value zpool prints
a misleading error message:

    $ sudo zpool set dedupditto=99 pp
    cannot set property for 'pp': property 'dedupditto'(14) not defined

Fix this by printing a meaningful error description for unsupported
'dedupditto' values.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #8079
2018-11-06 10:14:56 -08:00
Brian Behlendorf 09b85f2ded ztest: reduce gangblock creation
In order to validate the gang block code ztest is configured to
artificially force a fraction of large blocks to be written as
gang blocks.  The default setting chosen for this was to
write 25% of all blocks 32k or larger using gang blocks.

The confluence of an unrealistically large number of gang blocks,
the aggressive fault injection done by ztest, and the split
segment reconstruction logic introduced by device removal has
resulted in the following type of failure:

  zdb -bccsv -G -d ... exit code 3

Specifically, zdb was unable to open the pool because it was
unable to reconstruct a damaged block.  Manual investigation
of multiple failures clearly showed that the block could be
reconstructed.  However, due to the large number of damaged
segments (>35) it could not be done in the allotted time.

Furthermore, the large number of gang blocks was determined
to be the reason for the unrealistically large number of
damaged segments.  In order to make this situation less
likely, this change both increases the forced gang block
size to 64k and reduces the frequency to 3% of blocks.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8080
2018-11-05 11:53:49 -08:00
Don Brady e89f1295d4 Add libzutil for libzfs or libzpool consumers
Adds a libzutil for utility functions that are common to libzfs and
libzpool consumers (most of what was in libzfs_import.c).  This
removes the need for utilities to link against both libzpool and
libzfs.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #8050
2018-11-05 11:22:33 -08:00
Richard Elling 6644e5bb6e Update zfs-events.5 with info from PSARC 2009/497
Update zfs-events.5 with info from PSARC 2009/497 regarding ereport fields.
Also updates ZIO_STAGE_* and ZIO_FLAG_* descriptions to match current source.

Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #8057
2018-11-01 15:54:55 -07:00
Paul Zuchowski 04a88fc00c ZTS: Fix posix ACL tests that should pass
Make sure tests have proper include files.  Make sure underlying
"chmod" style permissions don't interfere with ACLs.

Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #8069
2018-10-31 18:58:43 -05:00
George Melikov 58aeb87a8f ZTS: change $(cat) to $(<) for speedup
It's better to use ksh/bash built in methods,
rather than spawn new processes every time.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Wren Kennedy <john.kennedy@delphix.com>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #8071
2018-10-31 12:00:06 -05:00
Matthew Ahrens 9553c533a6 bpobj_enqueue_subobj() should copy small subobj's
When we delete a snapshot, we consolidate some bpobj's together because
we no longer need to keep their entries in separate buckets.  This is
done in constant time by including the "sub" bpobj by reference in the
parent bpobj.

After many snapshots have been deleted, we may have many sub-bpobj's.
Usually, most sub-bpobj's don't contain many BP's.  Compared to this
small payload, the sub-bpobj is relatively heavyweight since it is a
object in the MOS.  A common scenario on a long-lived pool is for the
vast majority of MOS objects to be small sub-bpobj's.

To improve this situation, when consolidating bpobj's together,
bpobj_enqueue_subobj() can copy the contents of small bpobj's into the
parent, and then delete the enqueued bpobj, rather than including it by
reference.  Since this copying is limited in size (to one block), the
consolidation is still constant time, though with a larger constant due
to reading in the one block of the enqueued bpobj.

This idea and mechanism are similar to how we handle "sub-subobj's".
When including a sub-bpobj by reference, if the sub-bpobj itself has
less than a block of sub-sub-bpobj's, the list of sub-sub-bpobj's is
copied to the parent bpobj's list of sub-bpobj's.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8053 
Issue #7908
2018-10-31 11:58:17 -05:00
Brian Behlendorf 82c0a050fc Linux 4.20 compat: current_kernel_time()
Commit torvalds/linux@976516404 removed the current_kernel_time()
function (and several others).  All callers are expected to use
current_kernel_time64().  Update the gethrestime_sec() wrapper
accordingly.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8074
2018-10-31 11:50:42 -05:00
Md Islam 9042f6033a Improve snapshot listing error message
Provide a hint in the error message if listing snapshots for a
single dataset fails.

Using -r is not needed to list all snapshots so requiring it when
listing snapshots for a single dataset makes it confusing. This
change will make the error message more clear.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Md Islam <mdnahian@outlook.com>
Closes #8047
2018-10-30 11:47:50 -05:00
Serapheim Dimitropoulos 0a544c174d zdb -k does not work on Linux when used with -e
This minor bug was introduced with the port of the feature from
OpenZFS to ZoL. This patch fixes the issue that was caused by
a minor re-ordering from the original code.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8001
2018-10-30 11:46:18 -05:00
Gregor Kopka 63a77ae3cf Added column definitions to arcstat.py
grow: ARC Grow enabled (!arc_no_grow)
free: ARC Free memory (arc_sys_free)
need: ARC Reclaim need (arc_need_free)

Fixed alignment issues (mread had wrong width).

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gregor Kopka <gregor@kopka.net>
Closes #8058
2018-10-29 18:18:20 -05:00
Brian Behlendorf bea7578356 ZTS: Fix auto_replace_001_pos test
The root cause of these failures is that udev can notify the
ZED of newly created partition before its links are created.
Handle this by allowing an auto-replace to briefly wait until
udev confirms the links exist.

Distill this test case down to its essentials so it can be run
reliably.  What we need to check is that:

  1) A new disk, in the same physical location, is automatically
     brought online when added to the system,
  2) It completes the replacement process, and
  3) The pool is now ONLINE and healthy.

There is no need to remove the scsi_debug module.  After exporting
the pool the disk can be zeroed, removed, and then re-added to the
system as a new disk.

Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8051
2018-10-29 15:05:14 -05:00
Brian Behlendorf b74f48fe1b Fix flake8 "invalid escape sequence 'x'" warning
From, https://lintlyci.github.io/Flake8Rules/rules/W605.html

As of Python 3.6, a backslash-character pair that is not a valid
escape sequence now generates a DeprecationWarning. Although this
will eventually become a SyntaxError, that will not be for several
Python releases.

Note 'float_pobj' was simply removed from arcstat.py since it
was entirely unused.

Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8056
2018-10-24 23:26:08 -07:00
Brian Behlendorf b3d7725c94 Remove zfs_gitrev.h
This generated file was accidentally included in previous commit,
80a91e7, and should not be included in the repository.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8054
2018-10-24 14:48:14 -07:00
Tom Caputi 8cb119e3dc Fix 2 small bugs with cached dsl_scan_phys_t
This patch corrects 2 small bugs where scn->scn_phys_cached was
not properly updated to match the primary copy when it needed to
be. The first resulted in the pause state not being properly
updated and the second resulted in the cached version being
completely zeroed even if the primary was not.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:37:41 -07:00
Tom Caputi 7d658d29cf Fix waiting in ztest_device_removal()
spa->spa_vdev_removal is created in a sync task that is initiated
via dsl_sync_task_nowait(). Since the task may not run before
spa_vdev_remove() returns, we must wait at least 1 txg to ensure
that the removal struct has been created.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:37:35 -07:00
Tom Caputi 7ab96299e5 Fix ENXIO from spa_ld_verify_logs() in ztest
This patch fixes a small issue where the zil_check_log_chain()
code path would hit an EBUSY error. This would occur when
2 threads attempted to call metaslab_activate() at the same time.
In this case, the "loser" would receive an error code which should
have been ignored, but was instead floated to the caller. This
ended up resulting in an ENXIO being returned from from
spa_ld_verify_logs().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:37:33 -07:00
Tom Caputi 4a7eb69a5a Fix ztest deadman panic with indirect vdev damage
This patch fixes an issue where ztest's deadman thread would
trigger a panic because reconstructing artifically damaged
blocks would take too long to reconstruct. This patch simply
limits how often ztest inflicts split-block damage and how
many segments it can damage when it does.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:37:31 -07:00
Tom Caputi 5e0bd0ae05 Fix issue with scanning dedup blocks as scan ends
This patch fixes an issue discovered by ztest where
dsl_scan_ddt_entry() could add I/Os to the dsl scan queues
between when the scan had finished all required work and
when the scan was marked as complete. This caused the scan
to spin indefinitely without ending.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:37:15 -07:00
Tom Caputi a783dd9684 Fix lock inversion in txg_sync_thread()
This patch fixes a lock inversion issue in txg_sync_thread() where
the code would attempt hold the spa config lock as a reader while
holding tx->tx_sync_lock. This races with spa_vdev_remove() which
attempts to hold the tx->tx_sync_lock to assign a new tx (via
spa_history_log_internal()) while holding the spa config lock as a
writer.

Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:37:02 -07:00
Tom Caputi ab4c009e3d Fix dbgmsg printing in ztest and zdb
This patch resolves a problem where the -G option in both zdb and
ztest would cause the code to call __dprintf() to print zfs_dbgmsg
output. This function was not properly wired to add messages to the
dbgmsg log as it is in userspace and so the messages were simply
dropped. This patch also tries to add some degree of distinction to
dprintf() (which now prints directly to stdout) and zfs_dbgmsg()
(which adds messages to an internal list that can be dumped with
zfs_dbgmsg_print()).

In addition, this patch corrects an issue where ztest used a global
variable to decide whether to dump the dbgmsg buffer on a crash.
This did not work because ztest spins up more instances of itself
using execv(), which did not copy the global variable to the new
process. The option has been moved to the ztest_shared_opts_t
which already exists for interprocess communication.

This patch also changes zfs_dbgmsg_print() to use write() calls
instead of printf() so that it will not fail when used in a signal
handler.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:36:50 -07:00
Tom Caputi c04812f964 Fix ASSERT in zil_create() during ztest
This patch corrects an ASSERT in zil_create() that will only be
true if the call to zio_alloc_zil() does not fail.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:36:40 -07:00
Tom Caputi 9410257800 Fix random ztest_deadman_thread failures
The zloop test has been failing in buildbot for the last few weeks
with various failures in ztest_deadman_thread(). This is due to the
fact that this thread is not stopped when performing pool import /
export tests as it should be. This patch simply corrects this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #8010
2018-10-24 14:36:21 -07:00
George Melikov e871a8f058 Allow use of pool GUID as root pool
It's helpful if there are pools with same names,
but you need to use only one of them.

Main case is twin servers, meanwhile some software
requires the same name of pools (e.g. Proxmox).

Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Igor ‘guardian’ Lidin of Moscow, Russia
Closes #8052
2018-10-23 20:06:40 -07:00
Brian Behlendorf 3449243b6d ZTS: Update project quota tests
e2fsprogs v1.44.1, which provides lsattr, added a new attribute
for ext3 called "verity".  It is reported after the project quota
flag as a 'V' character in the `lsattr` output.

Update projectid_001_pos.ksh and projecttree_001_pos.ksh to use
a pattern which will match the expected output in both cases.

Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8043
2018-10-23 19:53:14 -07:00
Matthew Ahrens da4f331b41 Make gitrev more reliable
In some build methods, the gitrev is unnecessarily set to "unknown".
We can improve this by changing the gitrev to use
`git describe --always --long --dirty`.

This gets the revision even when no tag matches (--always).  It prints
the hash even when it exactly matches a tag (--long).  And if there are
uncommitted changes, it appends "-dirty", rather than failing (--dirty).

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8034
2018-10-22 12:23:30 -07:00
Paul Dagnelie ae3d849142 OpenZFS 9688 - aggsum_fini leaks memory
Porting Notes:
- Most of these fixes were applied in the original 37fb3e43
  commit when this change was ported for Linux.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9688
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/29bf2d68be
Closes #8042
2018-10-19 12:08:03 -07:00
Serapheim Dimitropoulos 9b2266e3d8 OpenZFS 9682 - page fault in dsl_async_clone_destroy() while opening pool
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/9682
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ade2c82828
Closes #8037
2018-10-19 12:06:21 -07:00
Serapheim Dimitropoulos ee900344f2 OpenZFS 9690 - metaslab of vdev with no space maps was flushed during removal
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9690
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4e75ba6826
Closes #8039
2018-10-19 12:05:03 -07:00
Matthew Ahrens d637db98e1 OpenZFS 9681 - ztest failure in spa_history_log_internal due to spa_rename()
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9681
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/6aee0ad7
Closes #8041
2018-10-19 12:02:28 -07:00
Tom Caputi 80a91e7469 Defer new resilvers until the current one ends
Currently, if a resilver is triggered for any reason while an
existing one is running, zfs will immediately restart the existing
resilver from the beginning to include the new drive. This causes
problems for system administrators when a drive fails while another
is already resilvering. In this case, the optimal thing to do to
reduce risk of data loss is to wait for the current resilver to end
before immediately replacing the second failed drive, which allows
the system to operate with two incomplete drives for the minimum
amount of time.

This patch introduces the resilver_defer feature that essentially
does this for the admin without forcing them to wait and monitor
the resilver manually. The change requires an on-disk feature
since we must mark drives that are part of a deferred resilver in
the vdev config to ensure that we do not assume they are done
resilvering when an existing resilver completes.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: @mmaybee 
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7732
2018-10-18 21:06:18 -07:00
Allan Jude 9f438c5f94 OpenZFS 9862 - fix typo in comment in vdev_impl.h
Authored by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tony Hutter <hutter2@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: George Melikov <mail@gmelikov.ru>

OpenZFS-issue: https://www.illumos.org/issues/9862
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/84927f52
Closes #8036
2018-10-18 15:09:27 -07:00
Matthew Thode 8d43194003 Allow copy-builtin to work with modified sources
`scripts/make_gitrev.sh` had 'set -e' so if any command failed it would
fail and cause copy-builtin to fail (copy-builtin also has `set -e`.
This commit also simplifies scripts/make_gitrev.sh to always write a
file by using a cleanup function.  It also simplifies other areas of
the script as well (making it much shorter).

Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Thode <mthode@mthode.org>
Closes #8022 
Closes #8025
2018-10-17 12:06:05 -07:00
LOLi 2e55034471 zpool: allow sharing of spare device among pools
ZFS allows, by default, sharing of spare devices among different pools;
this commit simply restores this functionality for disk devices and
adds an additional tests case to the ZFS Test Suite to prevent future
regression.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7999
2018-10-17 11:21:07 -07:00
Matthew Ahrens 49394a7708 Linux does not HAVE_SMB_SHARE
Since Linux does not have an in-kernel SMB server, we don't need the
code to manage it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8032
2018-10-17 10:31:38 -07:00
Matthew Ahrens 5fbf85c4e2 Linux does not HAVE_DNLC
Since Linux does not have the Directory Name Lookup Cache, we don't need
the code to manage it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8031
2018-10-17 10:30:08 -07:00
bunder2015 bfcb82cb54 Advise users to retain issue/PR templates
Occasionally we get issues and PRs from users who delete the
templates.  Advise users that their issues and PRs may be closed if
they do not fill out the templates as we really need this information.

Also updating PR template to drop unneeded approval toggle as we are
now using issue labels for status tracking.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #8029
2018-10-17 10:25:38 -07:00
Paul Dagnelie d52d80b700 Add types to featureflags in zfs
The boolean featureflags in use thus far in ZFS are extremely useful,
but because they take advantage of the zap layer, more interesting data
than just a true/false value can be stored in a featureflag. In redacted
send/receive, this is used to store the list of redaction snapshots for
a redacted dataset.

This change adds the ability for ZFS to store types other than a boolean
in a featureflag. The only other implemented type is a uint64_t array.
It also modifies the interfaces around dataset features to accomodate
the new capabilities, and adds a few new functions to increase
encapsulation.

This functionality will be used by the Redacted Send/Receive feature.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7981
2018-10-16 11:15:04 -07:00
ilbsmart 779a6c0bf6 deadlock between mm_sem and tx assign in zfs_write() and page fault
The bug time sequence:
1. thread #1, `zfs_write` assign a txg "n".
2. In a same process, thread #2, mmap page fault (which means the
   `mm_sem` is hold) occurred, `zfs_dirty_inode` open a txg failed,
   and wait previous txg "n" completed.
3. thread #1 call `uiomove` to write, however page fault is occurred
   in `uiomove`, which means it need `mm_sem`, but `mm_sem` is hold by
   thread #2, so it stuck and can't complete,  then txg "n" will
   not complete.

So thread #1 and thread #2 are deadlocked.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Grady Wong <grady.w@xtaotech.com>
Closes #7939
2018-10-16 11:11:24 -07:00
Brian Behlendorf b2030e5d51 Add zts-report.py to python shebang exclusion
Include zts-report.py is the __brp_mangle_shebangs_exclude_from
to resolve build failures in Fedora 28 and newer.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8020
Issue #7360
2018-10-15 16:59:28 -07:00
Matthew Ahrens 0aa5916a30 OpenZFS 9847 - leaking dd_clones (DMU_OT_DSL_CLONES) objects (#7979)
OpenZFS 9847 - leaking dd_clones (DMU_OT_DSL_CLONES) objects

We're leaking the dd_clones objects in dsl_dir_destroy_sync.  This bug
appears to have been around forever.  Thankfully the amount of space
typically involved is tiny.

In addition this adds a mechanism in ZDB to find objects in the MOS
which are leaked (not referenced anywhere).

Porting notes:
* Added dd_crypto_obj to ZDB MOS object leak tracking

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Matthew Ahrens <mahrens@delphix.com>

OpenZFS-issue: https://illumos.org/issues/9847
Closes #7979
2018-10-12 11:28:26 -07:00
Brian Behlendorf 27f80e85c2 Improved error handling for extreme rewinds
The vdev_checkpoint_sm_object(), vdev_obsolete_sm_object(), and
vdev_obsolete_counts_are_precise() functions assume that the
only way a zap_lookup() can fail is if the requested entry is
missing.  While this is the most common cause, it's not the only
cause.  Attemping to access a damaged ZAP will result in other
errors.

The most likely scenario for accessing a damaged ZAP is during
an extreme rewind pool import.  Under these conditions the pool
is expected to contain damaged objects and the import code was
updated to handle this gracefully.  Getting an ECKSUM error from
these ZAPs after the pool in import a far less likely, therefore
the behavior for call paths was not modified.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7809
Closes #7921
2018-10-12 11:24:04 -07:00
Brian Behlendorf d6c745830f Revert "Allow ECKSUM in vdev_checkpoint_sm_object()"
This reverts commit e927fc8a52.

Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7921
2018-10-12 11:24:04 -07:00
Tony Hutter 3c94dd7b7b Define timestruc_t for Lustre compatibility
Lustre 2.8 (and possibly other versions) are still using timestruc_t,
which was removed in spl-0.7.10 in favor of inode_timespec_t.  Add
in a backwards compatibility #define for timestruc_t so that Lustre
builds.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #8014
2018-10-12 11:13:34 -07:00
Matt Ahrens 5d43cc9a59 OpenZFS 9689 - zfs range lock code should not be zpl-specific
The ZFS range locking code in zfs_rlock.c/h depends on ZPL-specific
data structures, specifically znode_t.  However, it's also used by
the ZVOL code, which uses a "dummy" znode_t to pass to the range
locking code.

We should clean this up so that the range locking code is generic
and can be used equally by ZPL and ZVOL, and also can be used by
future consumers that may need to run in userland (libzpool) as
well as the kernel.

Porting notes:
* Added missing sys/avl.h include to sys/zfs_rlock.h.
* Removed 'dbuf is within the locked range' ASSERTs from dmu_sync().
  This was needed because ztest does not yet use a locked_range_t.
* Removed "Approved by:" tag requirement from OpenZFS commit
  check to prevent needless warnings when integrating changes
  which has not been merged to illumos.
* Reverted free_list range lock changes which were originally
  needed to defer the cv_destroy() which was called immediately
  after cv_broadcast().  With d2733258 this should be safe but
  if not we may need to reintroduce this logic.
* Reverts: The following two commits were reverted and squashed in
  to this change in order to make it easier to apply OpenZFS 9689.
  - d88895a0, which removed the dummy znode from zvol_state
  - e3a07cd0, which updated ztest to use range locks
* Preserved optimized rangelock comparison function.  Preserved the
  rangelock free list.  The cv_destroy() function will block waiting
  for all processes in cv_wait() to be scheduled and drop their
  reference.  This is done to ensure it's safe to free the condition
  variable.  However, blocking while holding the rl->rl_lock mutex
  can result in a deadlock on Linux.  A free list is introduced to
  defer the cv_destroy() and kmem_free() until after the mutex is
  released.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9689
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/680
External-issue: DLPX-58662
Closes #7980
2018-10-11 10:19:33 -07:00
Alek P 50a343d85c Fix changelist mounted-dataset iteration
Commit 0c6d093 caused a regression in the inherit codepath.
The fix is to restrict the changelist iteration on mountpoints and
add proper handling for 'legacy' mountpoints

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #7988 
Closes #7991
2018-10-10 21:13:13 -07:00
Garrett Fields 5b3bfd86a4 Check scheduler for "noop" before setting "noop"
Originally code only checked for presence of "/sys/block/$i/queue/
scheduler".  "sh: write error: Invalid argument" was produced when
trying to set "noop" on certain devices (eg. virtio) when it isn't
a listed option. This modification continues to check for the presence
of "/sys/block/$i/queue/scheduler" and also checks that it contains
"noop" as an option before setting "noop".

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Garrett Fields <ghfields@gmail.com>
Closes #8004
2018-10-10 08:46:22 -07:00
Tony Hutter 2ef0f8c329 Print "(repairing)" in zpool status again
Historically, zpool status prints "(repairing)" for any drives that
have errors during a scrub:

        NAME            STATE     READ WRITE CKSUM
        mypool          ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            /tmp/file1  ONLINE      13     0     0  (repairing)
            /tmp/file2  ONLINE       0     0     0
            /tmp/file3  ONLINE       0     0     0

This was accidentally broken in "OpenZFS 9166 - zfs storage pool
checkpoint" (d2734cc).  This patch adds it back in.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7779
Closes #7978
2018-10-09 20:30:32 -07:00
Paul Dagnelie 0391690583 Refactor dmu_recv into its own file
This change moves the bottom half of dmu_send.c (where the receive
logic is kept) into a new file, dmu_recv.c, and does similarly
for receive-related changes in header files.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7982
2018-10-09 14:05:13 -07:00
Brian Behlendorf 5e8ff25644 Fix arc_release() refcount
Update arc_release to use arc_buf_size().  This hunk was accidentally
dropped when porting compressed send/recv, 2aa34383b.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8000
2018-10-09 10:10:26 -07:00
Brian Behlendorf d7e4b30a67 Add zfs_refcount_transfer_ownership_many()
When debugging is enabled and a zfs_refcount_t contains multiple holders
using the same key, but different ref_counts, the wrong reference_t may
be transferred.  Add a zfs_refcount_transfer_ownership_many() function,
like the existing zfs_refcount_*_many() functions, to match and transfer
the correct refcount_t;

This issue may occur when using encryption with refcount debugging
enabled.  An arc_buf_hdr_t can have references for both the
hdr->b_l1hdr.b_pabd and hdr->b_crypt_hdr.b_rabd both of which use
the hdr as the reference holder.  When unsharing the buffer the
p_abd should be transferred.

This issue does not impact production builds because refcount holders
are not tracked.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7219
Closes #8000
2018-10-09 10:05:48 -07:00
Matthew Ahrens 4cbde2ecbf Create /proc/sys/kernel/spl/gitrev with git hash
The existing mechanisms for determining what code is running in the
kernel do not always correctly report the git hash.  The versions
reported there do not reflect changes made since `configure` was run
(i.e. incremental builds do not update the version) and they are
misleading if git tags are not set up properly.  This applies to
`modinfo zfs`, `dmesg`, and `/sys/module/zfs/version`.

There are complicated requirements on how the existing version is
generated.  Therefore we are leaving that alone, and adding a new
mechanism to record and retrieve the git hash:
`cat /proc/sys/kernel/spl/gitrev`

The gitrev is re-generated at compile time, when running `make`
(including for incremental builds).  The value is the output of `git
describe` (or "unknown" if not in a git repo or there are uncommitted
changes).

We're also removing /proc/sys/kernel/spl/version, which was never very
useful.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7931 
Closes #7965
2018-10-08 21:57:02 -07:00
Matthew Ahrens dfbe267503 OpenZFS 9617 - too-frequent TXG sync causes excessive write inflation
Porting notes:
* Renamed zfs_dirty_data_sync_pct to zfs_dirty_data_sync_percent and
  changed the type to be consistent with the other dirty module params.
* Updated zfs-module-parameters.5 accordingly.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9617
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7928f4ba
Closes #7976
2018-10-04 13:13:28 -07:00
Matthew Ahrens 58c0f374f1 Warn if checking programs are not installed
`make checkstyle` silently skips checks if the required programs are not
installed (e.g. shellcheck, mandoc).  Therefore developers may not
realize that they are not getting the full suite of code checks.  This
also applies to more specific targets like `make shellcheck`.

We should print a warning message when a check is skipped due to missing
tools.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7984
2018-10-04 13:11:45 -07:00
Matthew Ahrens c23f8d4829 Add codecheck make target
We'd like to have tooling that verifies code style, while ignoring the
commit message.  For example, code does not need to be signed off in
order to be tested.  Current workarounds are to run `git checkstyle` and
ignore the commit message errors, or to run `make cstyle shellcheck
flake8 mancheck testscheck`, and make sure that list stays updated.

Solution is to add a new make target, `codecheck` which does all the
code checks.  `checkstyle` is now simply `codecheck` + `commitcheck`.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7985
2018-10-04 13:10:10 -07:00
Paul Dagnelie 6e8b268875 Fix ASSERT macros to not over-expand
The code reuse in the definitions of the ASSERT and VERIFY macros result
in expansion of their arguments before they are stringified, which
produces ugly and undesirable output.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7884
2018-10-03 20:16:45 -07:00
Paul Dagnelie 95542372e6 Add new fnvlist_lookup_* functions
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7977
2018-10-03 15:30:55 -07:00
Prakash Surya 54eb2c410e Verify 'zfs destroy' will unshare the dataset
This change adds a new test case to the zfs-test suite to verify that
when 'zfs destroy' is used on a shared dataset, the dataset will be
unshared after the destroy operation completes.

Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #7941
2018-10-03 10:17:58 -07:00
Prakash Surya 1bf490ba93 Fix "zfs destroy" when "sharenfs=on" is used
When using "zfs destroy" on a dataset that is using "sharenfs=on" and
has been automatically exported (by libzfs), the dataset will not be
automatically unexported as it should be. This workflow appears to have
been broken by this commit: 3fd3e56cfd

In that change, the "zfs_unmount" function was modified to use the
"mnt.mnt_special" field when determining the mount point that is being
unmounted, rather than "mnt.mnt_mountp".

As a result, when "mntpt" is passed into "zfs_unshare_proto", it's value
is now the dataset name rather than the mountpoint. Thus, when this
value is used with the "is_shared" function (via "zfs_unshare_proto") it
will not find a match (since that function assumes it'll be passed the
mountpoint) and incorrectly reports that the dataset is not shared.

This can be easily reproduced with the following commands:

    $ sudo zpool create tank xvdb
    $ sudo zfs create -o sharenfs=on tank/fish
    $ sudo zfs destroy tank/fish

    $ sudo zfs list -r tank
    NAME   USED  AVAIL  REFER  MOUNTPOINT
    tank  97.5K  7.27G    24K  /tank

    $ sudo exportfs
    /tank/fish      <world>
    $ sudo cat /etc/dfs/sharetab
    /tank/fish      -       nfs     rw,crossmnt

At this point, the "tank/fish" filesystem doesn't exist, but it's still
listed as exported when looking at "exportfs" and "/etc/dfs/sharetab".

Also note, this change brings us back in-sync with the illumos code, as
it pertains to this one line; on illumos, "mnt.mnt_mountp" is used.

Reviewed by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Issue #6143
Closes #7941
2018-10-03 10:17:58 -07:00
Brad Lewis c955398b52 OpenZFS 9677 - panic from zio_write_gang_block()
Panic from zio_write_gang_block() when creating dump device
on fragmented rpool.

Authored by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9677
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7341a7d
Closes #7975
2018-10-03 09:50:06 -07:00
Andrew Stormont 84ddd4b062 OpenZFS 9616 - Bogus error when attempting to set property on read-only pool
Authored by: Andrew Stormont <astormont@racktopsystems.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9616
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f62db44d
Closes #7974
2018-10-03 09:49:30 -07:00
Tom Caputi 52ce99dd61 Refcounted DSL Crypto Key Mappings
Since native ZFS encryption was merged, we have been fighting
against a series of bugs that come down to the same problem: Key
mappings (which must be present during all I/O operations) are
created and destroyed based on dataset ownership, but I/Os can
have traditionally been allowed to "leak" into the next txg after
the dataset is disowned.

In the past we have attempted to solve this problem by trying to
ensure that datasets are disowned ater all I/O is finished by
calling txg_wait_synced(), but we have repeatedly found edge cases
that need to be squashed and code paths that might incur a high
number of txg syncs. This patch attempts to resolve this issue
differently, by adding a reference to the key mapping for each txg
it is dirtied in. By doing so, we can remove many of the
unnecessary calls to txg_wait_synced() we have added in the past
and ensure we don't need to deal with this problem in the future.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7949
2018-10-03 09:47:11 -07:00
Jerry Jelinek f65fbee1e7 OpenZFS 9700 - ZFS resilvered mirror does not balance reads
Authored by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9700
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/82f63c3c
Closes #7973
2018-10-02 16:18:24 -07:00
Yuri Pankov cb110f254e OpenZFS 9763 - zfs(1M): broken formatting in allow/unallow description
Porting notes:
* Two of the three changes from the upstream patch were already
  applied for Linux.  Only the last one is required.

Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9763
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8a702e55
Closes #7972
2018-10-02 16:12:54 -07:00
Alek P 0c6d09361d changelist should be able to iter on mounts
Modified changelist_gather()ing for the mountpoint property.
Now instead of iterating on all dataset descendants, we read
/proc/self/mounts and iterate on the mounted descendant datasets only.

Switched changelist implementation from a uu_list_* to uu_avl_* in
order to  reduce changlist code-path's worst case time complexity.

Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #7967
2018-10-02 12:30:58 -07:00
Brian Behlendorf 838bd5ff35 ZTS: Fix snapshot_009_pos, snapshot_010_pos
Mitigate the likelihood of the newly created volumes being busy
when the 'zfs destroy -r' is issued by waiting for udev to settle.
Since this is not a iron clad fix I've added the test case to
the known list of possible failures and referenced issue #7961.

Finally, in the case this test does fail fix the cleanup logic
so subsequent tests won't incorrectly fail.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7961 
Closes #7962
2018-10-01 17:15:57 -07:00
Tim Schumacher 424fd7c3e0 Prefix all refcount functions with zfs_
Recent changes in the Linux kernel made it necessary to prefix
the refcount_add() function with zfs_ due to a name collision.

To bring the other functions in line with that and to avoid future
collisions, prefix the other refcount functions as well.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Closes #7963
2018-10-01 10:42:05 -07:00
Matthew Ahrens fc23d59fa0 Remove duplicate macro in dsl_dir.h
The DD_FIELD_LAST_REMAP_TXG macro was added twice (with the same value).
This change removes one of them.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7968
2018-10-01 10:40:11 -07:00
Brian Behlendorf 1258bd778e Refine split block reconstruction
Due to a flaw in 4589f3ae the number of unique combinations
could be calculated incorrectly.  This could result in the
random combinations reconstruction being used when it would
have been possible to check all combinations.

This change fixes the unique combinations calculation and
simplifies the reconstruction logic by maintaining a per-
segment list of unique copies.

The vdev_indirect_splits_damage() function was introduced
to validate both the enumeration and random reconstruction
logic with ztest.  It is implemented such it will never
make a known recoverable block unrecoverable.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6900 
Closes #7934
2018-10-01 10:36:34 -07:00
John Gallagher d12614521a Fixes for procfs files backed by linked lists
There are some issues with the way the seq_file interface is implemented
for kstats backed by linked lists (zfs_dbgmsgs and certain per-pool
debugging info):

* We don't account for the fact that seq_file sometimes visits a node
  multiple times, which results in missing messages when read through
  procfs.
* We don't keep separate state for each reader of a file, so concurrent
  readers will receive incorrect results.
* We don't account for the fact that entries may have been removed from
  the list between read syscalls, so reading from these files in procfs
  can cause the system to crash.

This change fixes these issues and adds procfs_list, a wrapper around a
linked list which abstracts away the details of implementing the
seq_file interface for a list and exposing the contents of the list
through procfs.

Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1211
Closes #7819
2018-09-26 11:08:12 -07:00
Gregor Kopka 3ed2fbcc1c Fix flake 8 style warnings
Ran zts-report.py and test-runner.py from ./tests/test-runner/bin/
through the 2to3 (https://docs.python.org/2/library/2to3.html).
Checked the result, fixed:
- 'maxint' -> 'maxsize' that 2to3 missed.
- 'cmp=' parameter for a 'sorted()' with a 'key=' version.
- try/except wrapping of configparser import as there are still
  python 2.7 systems that lack a compatibility shim

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gregor Kopka <gregor@kopka.net>
Closes #7925 
Closes #7952
2018-09-26 11:02:26 -07:00
Tim Schumacher c13060e478 Linux 4.19-rc3+ compat: Remove refcount_t compat
torvalds/linux@59b57717f ("blkcg: delay blkg destruction until
after writeback has finished") added a refcount_t to the blkcg
structure. Due to the refcount_t compatibility code, zfs_refcount_t
was used by mistake.

Resolve this by removing the compatibility code and replacing the
occurrences of refcount_t with zfs_refcount_t.

Reviewed-by: Franz Pletz <fpletz@fnordicwalking.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Schumacher <timschumi@gmx.de>
Closes #7885 
Closes #7932
2018-09-26 10:29:26 -07:00
Brian Behlendorf 7a23c81342 Fix small sysfs leak
When zfs_kobj_init() is called with an attr_cnt of 0 only the
kobj->zko_default_attrs is allocated.  It subsequently won't
get freed in zfs_kobj_release since the free is wrapped in
a kobj->zko_attr_count != 0 conditional.

Split the block in zfs_kobj_release() to make sure the
kobj->zko_default_attrs are freed in this case.

Additionally, fix a minor spelling mistake and typo in
zfs_kobj_init() which could also cause a leak but in practice
is almost certain not to fail.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7957
2018-09-26 09:50:58 -07:00
Gregor Kopka b954e36e51 Zpool iostat: remove latency/queue scaling
Bandwidth and iops are average per second while *_wait are averages
per request for latency or, for queue depths, an instantaneous
measurement at the end of an interval (according to man zpool).

When calculating the first two it makes sense to do
x/interval_duration (x being the increase in total bytes or number of
requests over the duration of the interval, interval_duration in
seconds) to 'scale' from amount/interval_duration to amount/second.

But applying the same math for the latter (*_wait latencies/queue) is
wrong as there is no interval_duration component in the values (these
are time/requests to get to average_time/request or already an
absulute number).

This bug leads to the only correct continuous *_wait figures for both
latencies and queue depths from 'zpool iostat -l/q' being with
duration=1 as then the wrong math cancels itself (x/1 is a nop).

This removes temporal scaling from latency and queue depth figures.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gregor Kopka <gregor@kopka.net>
Closes #7945 
Closes #7694
2018-09-25 16:29:16 -07:00
Brian Behlendorf a7165d7255 Revert "Fix flake 8 style warnings"
This reverts commit b8fd4310c5 which
accidentally introduced a regression for some versions of python.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7929
2018-09-24 17:20:42 -07:00
Brian Behlendorf e897a23eb1 Fix statfs(2) for 32-bit user space
When handling a 32-bit statfs() system call the returned fields,
although 64-bit in the kernel, must be limited to 32-bits or an
EOVERFLOW error will be returned.

This is less of an issue for block counts since the default
reported block size in 128KiB. But since it is possible to
set a smaller block size, these values will be scaled as
needed to fit in a 32-bit unsigned long.

Unlike most other filesystems the total possible file counts
are more likely to overflow because they are calculated based
on the available free space in the pool. In order to prevent
this the reported value must be capped at 2^32-1. This is
only for statfs(2) reporting, there are no changes to the
internal ZFS limits.

Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7927 
Closes #7122 
Closes #7937
2018-09-24 17:11:25 -07:00
LOLi 36e369ecb8 ZTS: Fix removal_resume_export
This change simplify the test case removing part of the logic which was
introducing a race condition and thus causing spurious failures: we use
attempt_during_removal() from removal.kshlib instead which has been
observed to be more stable.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7894 
Closes #7913
2018-09-24 12:58:16 -07:00
Gregor Kopka b8fd4310c5 Fix flake 8 style warnings
Ran zts-report.py and test-runner.py from ./tests/test-runner/bin/
through the 2to3 (https://docs.python.org/2/library/2to3.html).
Checked the result, fixed:
- 'maxint' -> 'maxsize' that 2to3 missed.
- 'cmp=' parameter for a 'sorted()' with a 'key=' version.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Signed-off-by: Gregor Kopka <gregor@kopka.net>
Closes #7925 
Closes #7929
2018-09-24 10:12:59 -07:00
LOLi dda5500853 vdev_disk_error() prints ASCII SOH to debug log
Currently vdev_disk_error() prepends its messages sent to the internal
ZFS debug log with KERN_WARNING, which is currently defined as follows:

   #define KERN_SOH      "\001"
   #define KERN_WARNING  KERN_SOH "4"

Since "\001" (ASCII Start Of Header) is not printable this results in
weird characters displayed when inspecting the debug log. This commit
simply removes this superfluous prefix passed to zfs_dbgmsg().

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7936
2018-09-21 09:42:42 -07:00
DeHackEd 9d489ab3a8 Fix reference to zpool-features(5)
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: DHE <git@dehacked.net>
Closes #7938
2018-09-21 09:41:08 -07:00
LOLi 7522a26077 Add limits to spa_slop_shift tunable
This change adds limits to the possible spa_slop_shift values set via
the sysfs interface. Accepted values are from a minimum of 1 to a
maximum of 31 (inclusive): these limits are based on the following
values observed on a 128PB file-vdev test pool:

spa_slop_shift=1, spa_get_slop_space=63.5PiB
spa_slop_shift=2, spa_get_slop_space=31.8PiB
spa_slop_shift=3, spa_get_slop_space=15.9PiB
spa_slop_shift=4, spa_get_slop_space=7.9PiB
spa_slop_shift=5, spa_get_slop_space=4PiB
spa_slop_shift=6, spa_get_slop_space=2PiB
...
spa_slop_shift=25, spa_get_slop_space=4GiB
spa_slop_shift=26, spa_get_slop_space=2GiB
spa_slop_shift=27, spa_get_slop_space=1016MiB
spa_slop_shift=28, spa_get_slop_space=508MiB
spa_slop_shift=29, spa_get_slop_space=254MiB
spa_slop_shift=30, spa_get_slop_space=128MiB
spa_slop_shift=31, spa_get_slop_space=128MiB
spa_slop_shift=32, spa_get_slop_space=128MiB

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7876 
Closes #7900
2018-09-20 21:10:12 -07:00
Richard Yao 145c88fb7b Add NEWS file
I received a request for a NEWS file. That needs to be handled by Tony
and Brian, but for now, we can at least provide one that provides a link
to github so that users who expect NEWS files from their packages will
know where to go for release information.

Reviewed-by: Neal Gompa <ngompa@datto.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #7918
2018-09-18 12:03:47 -07:00
LOLi e0b7ff46c9 zstreamdump dumps core printing truncated nvlist
This change prevents zstreamdump from crashing when trying to print
invalid nvlist data (DRR_BEGIN record) from a truncated send stream.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7917
2018-09-18 09:43:09 -07:00
DeHackEd 81155b296d Fix allocation_classes GUID in zpool-features(5)
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: DHE <git@dehacked.net>
Closes #7920
2018-09-18 08:59:47 -07:00
bunder2015 0ff15c2753 Add new wiki page to CONTRIBUTING
A new wiki page has been added for users who may be new to Git or
GitHub.  Adding link to CONTRIBUTING.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Gregor Kopka <gregor@kopka.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7923 
Closes #7915
2018-09-18 08:58:15 -07:00
Gregor Kopka 48b0b649fd Man page fixes - zpool/zfs optional parameters
The man pages for zpool and zfs (get command)
listed the pool/dataset parameter as required,
but these are optional. Fixed that.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gregor Kopka <gregor@kopka.net>
Closes #7916
2018-09-18 08:55:33 -07:00
Brian Behlendorf 2ced3cf0b2 Clarify 'zpool remove' restrictions
Update zpool(8) to clarify what type of vdevs may be safely
removed and that the existence of any top-level raidz device
which is part of the primary pool will prevent device removal.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7880 
Closes #7893
2018-09-17 17:28:18 -07:00
LOLi 5140a58f3b zpool should detect invalid fs property on create
This change improve the handling of invalid filesystem properties when
specified at pool creation: this is useful when 'zpool create -n'
(dry run) is executed to detect invalid fs-level options (-O) before
the actual command is run.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7620 
Closes #7878
2018-09-13 13:37:42 -07:00
Brian Behlendorf 92b432139d Add removal_resume_export to zts-report.py
Add the removal_resume_export test case to the possible failure
section of the zts-report.py and reference the Github issue.  In
the CI environment this test has proven to be unreliable due to
the way it detects the removal thread.  This is a flaw in the test
and not device removal so update the result summary accordingly.

Additionally, increase the allowed timeout in an effort to reduce
the observed rate of false positves.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7895
Issue #7894
2018-09-13 13:35:09 -07:00
Roman Strashkin 733b5722b4 zpool split can create a corrupted pool
Added vdev_resilver_needed() check to verify VDEVs are fully
synced, so that after split the new pool will not be corrupted.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com>
Closes #7865
Closes #7881
2018-09-12 18:14:42 -07:00
Brian Behlendorf b8a90418f3 Tag 0.8.0-rc1
Major new features:
- Native encryption
- Device removal
- Allocation classes
- Pool checkpoints
- Sequential scrub and resilver
- Project quota
- Channel programs
- Direct IO

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-09-07 09:35:09 -07:00
Don Brady 73a5ec30bf Fix in-kernel sysfs entries
The recent sysfs zfs properties feature breaks the in-kernel
builds of zfs (sans module).  When not built as a module add
the sysfs entries under /sys/fs/zfs/.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7868 
Closes #7872
2018-09-06 21:44:52 -07:00
Don Brady e7b677aa5d Fix zfs_sysfs_live test failure
The ZTS zfs_sysfs_live test fails occasionally due to an uninitialized
string on an error path.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7869
2018-09-06 17:36:00 -07:00
LOLi 0238a9755b Fix 'zfs allow' for create time permissions
When no permission set is defined for a dataset the create time
permissions are incorrectly shown as if they were a permission set.
This change simply correct how allow permissions are displayed.

This commit also fixes a small manpage formatting issue and adds the
"zfs_allow_003_pos" test case to the ZFS Test Suite.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7519 
Closes #7860
2018-09-06 13:11:21 -07:00
Don Brady cc99f275a2 Pool allocation classes
Allocation Classes add the ability to have allocation classes in a
pool that are dedicated to serving specific block categories, such
as DDT data, metadata, and small file blocks. A pool can opt-in to
this feature by adding a 'special' or 'dedup' top-level VDEV.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Reviewed-by: Andreas Dilger <andreas.dilger@chamcloud.com>
Reviewed-by: DHE <git@dehacked.net>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Gregor Kopka <gregor@kopka.net>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #5182
2018-09-05 18:33:36 -07:00
Chris Siebenmann cfa37548eb Correctly handle errors from kern_path
As a regular kernel function, kern_path() returns errors as negative
errnos, such as -ELOOP. zfsctl_snapdir_vget() must convert these into
the positive errnos used throughout the ZFS code when it returns them
to other ZFS functions so that the ZFS code properly sees them as
errors.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Siebenmann <cks.git01@cs.toronto.edu>
Closes #7764
Closes #7864
2018-09-04 22:26:56 -07:00
Rich Ercolani 0405eeea6a Added recalculation of ARC stats mid-eviction
Re-adds a recalculation step for the ARC stats after the MRU
eviction so that we don't pathologically attempt to evict the MFU.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Mark Johnston <markj@freebsd.org>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #7855
2018-09-04 22:15:14 -07:00
Brian Behlendorf 27ca030fa6 Revert "Update zfs_admin_snapshot default value (disabled)"
This reverts commit a6214a0ae9.
Disabling zfs_admin_snapshot by default results in multiple ZTS
tests failing which depend on this functionality.  Revert this
change until the relevant test cases can be updated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7838
2018-09-04 22:00:21 -07:00
George Melikov a6214a0ae9 Update zfs_admin_snapshot default value (disabled)
It's disabled by default, update code to reflect
the documentation.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Gregor Kopka <gregor@kopka.net>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #7835 
Closes #7838
2018-09-04 17:21:24 -07:00
mav c197a77c3c OpenZFS 9751 - Allocation throttling misplacing ditto blocks
Relax allocation throttling for ditto blocks.  Due to random imbalances
in allocation it tends to push block copies to one vdev, that looks
slightly better at the moment.  Slightly less strict policy allows both
improve data security and surprisingly write performance, since we don't
need to touch extra metaslabs on each vdev to respect the min distance.

Sponsored by:	iXsystems, Inc.

Authored by: mav <mav@FreeBSD.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9751
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/8253837ac3
Closes #7857
2018-09-02 12:22:45 -07:00
mav e38afd34c3 OpenZFS 9738 - Fix third block copy allocations, broken at 9112.
Use METASLAB_WEIGHT_CLAIM weight to allocate tertiary blocks.
Previous use of METASLAB_WEIGHT_SECONDARY for that caused errors
later on metaslab_activate_allocator() call, leading to massive
load of unneeded metaslabs and write freezes.

Authored by: mav <mav@FreeBSD.org>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9738
FreeBSD-commit: https://github.com/freebsd/freebsd/commit/63e7138
Closes #7858
2018-09-02 12:21:54 -07:00
Don Brady b83a0e2dc1 Add basic zfs ioc input nvpair validation
We want newer versions of libzfs_core to run against an existing
zfs kernel module (i.e. a deferred reboot or module reload after
an update).

Programmatically document, via a zfs_ioc_key_t, the valid arguments 
for the ioc commands that rely on nvpair input arguments (i.e. non 
legacy commands from libzfs_core). Automatically verify the expected 
pairs before dispatching a command.

This initial phase focuses on the non-legacy ioctls. A follow-on 
change can address the legacy ioctl input from the zfs_cmd_t.

The zfs_ioc_key_t for zfs_keys_channel_program looks like:

static const zfs_ioc_key_t zfs_keys_channel_program[] = {
       {"program",     DATA_TYPE_STRING,               0},
       {"arg",         DATA_TYPE_UNKNOWN,              0},
       {"sync",        DATA_TYPE_BOOLEAN_VALUE,        ZK_OPTIONAL},
       {"instrlimit",  DATA_TYPE_UINT64,               ZK_OPTIONAL},
       {"memlimit",    DATA_TYPE_UINT64,               ZK_OPTIONAL},
};

Introduce four input errors to identify specific input failures
(in addition to generic argument value errors like EINVAL, ERANGE, 
EBADF, and E2BIG).

ZFS_ERR_IOC_CMD_UNAVAIL the ioctl number is not supported by kernel
ZFS_ERR_IOC_ARG_UNAVAIL an input argument is not supported by kernel
ZFS_ERR_IOC_ARG_REQUIRED a required input argument is missing
ZFS_ERR_IOC_ARG_BADTYPE an input argument has an invalid type

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7780
2018-09-02 12:14:01 -07:00
Don Brady e8bcb693d6 Add zfs module feature and property info to sysfs
This extends our sysfs '/sys/module/zfs' entry to include feature 
and property attributes. The primary consumer of this information 
is user processes, like the zfs CLI, that need to know what the 
current loaded ZFS module supports. The libzfs binary will consult 
this information when instantiating the zfs and zpool property 
tables and the pool features table.

This introduces 4 kernel objects (dirs) into '/sys/module/zfs'
with corresponding attributes (files):
  features.runtime
  features.pool
  properties.dataset
  properties.pool

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7706
2018-09-02 12:09:53 -07:00
Brian Behlendorf bb91178e60 ZTS: Fix EBUSY volume destroy failures
It's possible for an unrelated process, like blkid, to have the
volume open when 'zfs destroy' is run.  Switch the cleanup functions
to the destroy_dataset() helper which handles this case by retrying
the destroy when the dataset is busy.  This was done not only for
volumes but also for file systems for consistency.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7854
2018-08-31 15:30:44 -07:00
Brian Behlendorf e927fc8a52 Allow ECKSUM in vdev_checkpoint_sm_object()
The checkpoint space map object may not be accessible from the
vdev's ZAP when it has been damaged.  This may be the case when
performing an extreme rewind when importing the pool.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7809
Closes #7853
2018-08-31 14:20:34 -07:00
Matthew Ahrens adb726eb0e clean up __dbuf_hold_impl
We can simplify the dbuf_hold code by allocating dbuf_hold_arg_t's on
demand, rather than allocating a big array of them up front.  While this
can occasionally increase the number of allocations, typically only one
allocation is needed since the indirect block is already cached.

The performance test suite gets the same results with this change.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7841
2018-08-31 10:16:54 -07:00
bunder2015 9e7fb6c171 ZTS: pool_checkpoint path cleanup
Removing hardcoded paths in pool_checkpoint.kshlib

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7840
2018-08-30 14:45:16 -07:00
Richard Elling 6fa1e1e73a ZTS: Fix DEV_DSKDIR trim from disk
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #7848
2018-08-30 14:43:37 -07:00
bunder2015 de61daa597 ZTS: zvol_swap_003 path cleanup
Removing hardcoded paths in zvol_swap_003

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7839
2018-08-30 13:53:06 -07:00
bernie1995 0fe7c953b3 ZTS: path cleanup
Removing hardcoded paths in many scripts.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bernie1995 <bernie.pikes@gmail.com>
Issue #7507 
Closes #7843
2018-08-30 13:46:55 -07:00
Brian Behlendorf 6c6949acae ZTS: Fix zfs_create_013_pos
It's possible for an unrelated process, like blkid, to have the
volume open when 'zfs destroy' is run.  Switch the cleanup function
to the destroy_dataset() helper which handles this case by retrying
the destroy when the dataset is busy.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7847
2018-08-30 13:38:09 -07:00
Tom Caputi c3bd3fb4ac OpenZFS 9403 - assertion failed in arc_buf_destroy()
Assertion failed in arc_buf_destroy() when concurrently reading
block with checksum error.

Porting notes:
* The ability to zinject decompression errors has been added, but
  this only works at the zio_decompress() level, where we have all
  of the info we need to match against the user's zinject options.
* The decompress_fault test has been added to test the new zinject
  functionality
* We attempted to set zio_decompress_fail_fraction to (1 << 18) in
  ztest for further test coverage. Although this did uncover a few
  low priority issues, this unfortuantely also causes ztest to
  ASSERT in many locations where the code is working correctly since
  it is designed to fail on IO errors. Developers can manually set
  this variable with the '-o' option to find and debug issues.

Authored by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Tom Caputi <tcaputi@datto.com>

OpenZFS-issue: https://illumos.org/issues/9403
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/fa98e487a9
Closes #7822
2018-08-29 11:33:33 -07:00
Tom Caputi 47ab01a18f Always wait for txg sync when umounting dataset
Currently, when unmounting a filesystem, ZFS will only wait for
a txg sync if the dataset is dirty and not readonly. However, this
can be problematic in cases where a dataset is remounted readonly
immediately before being unmounted, which often happens when the
system is being shut down. Since encrypted datasets require that
all I/O is completed before the dataset is disowned, this issue
causes problems when write I/Os leak into the txgs after the
dataset is disowned, which can happen when sync=disabled.

While looking into fixes for this issue, it was discovered that
dsl_dataset_is_dirty() does not return B_TRUE when the dataset has
been removed from the txg dirty datasets list, but has not actually
been processed yet. Furthermore, the implementation is comletely
different from dmu_objset_is_dirty(), adding to the confusion.
Rather than relying on this function, this patch forces the umount
code path (and the remount readonly code path) to always perform a
txg sync on read-write datasets and removes the function altogether.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7753
Closes #7795
2018-08-27 10:16:28 -07:00
Tom Caputi 8c4fb36a24 Small rework of txg_list code
This patch simply adds some missing locking to the txg_list
functions and refactors txg_verify() so that it is only compiled
in for debug builds.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7795
2018-08-27 10:16:01 -07:00
Brian Behlendorf a584ef2605 Direct IO support
Direct IO via the O_DIRECT flag was originally introduced in XFS by
IRIX for database workloads. Its purpose was to allow the database
to bypass the page and buffer caches to prevent unnecessary IO
operations (e.g.  readahead) while preventing contention for system
memory between the database and kernel caches.

On Illumos, there is a library function called directio(3C) that
allows user space to provide a hint to the file system that Direct IO
is useful, but the file system is free to ignore it. The semantics
are also entirely a file system decision. Those that do not
implement it return ENOTTY.

Since the semantics were never defined in any standard, O_DIRECT is
implemented such that it conforms to the behavior described in the
Linux open(2) man page as follows.

    1.  Minimize cache effects of the I/O.

    By design the ARC is already scan-resistant which helps mitigate
    the need for special O_DIRECT handling.  Data which is only
    accessed once will be the first to be evicted from the cache.
    This behavior is in consistent with Illumos and FreeBSD.

    Future performance work may wish to investigate the benefits of
    immediately evicting data from the cache which has been read or
    written with the O_DIRECT flag.  Functionally this behavior is
    very similar to applying the 'primarycache=metadata' property
    per open file.

    2. O_DIRECT _MAY_ impose restrictions on IO alignment and length.

    No additional alignment or length restrictions are imposed.

    3. O_DIRECT _MAY_ perform unbuffered IO operations directly
       between user memory and block device.

    No unbuffered IO operations are currently supported.  In order
    to support features such as transparent compression, encryption,
    and checksumming a copy must be made to transform the data.

    4. O_DIRECT _MAY_ imply O_DSYNC (XFS).

    O_DIRECT does not imply O_DSYNC for ZFS.  Callers must provide
    O_DSYNC to request synchronous semantics.

    5. O_DIRECT _MAY_ disable file locking that serializes IO
       operations.  Applications should avoid mixing O_DIRECT
       and normal IO or mmap(2) IO to the same file.  This is
       particularly true for overlapping regions.

    All I/O in ZFS is locked for correctness and this locking is not
    disabled by O_DIRECT.  However, concurrently mixing O_DIRECT,
    mmap(2), and normal I/O on the same file is not recommended.

This change is implemented by layering the aops->direct_IO operations
on the existing AIO operations.  Code already existed in ZFS on Linux
for bypassing the page cache when O_DIRECT is specified.

References:
  * http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
  * https://blogs.oracle.com/roch/entry/zfs_and_directio
  * https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
  * https://illumos.org/man/3c/directio

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #224 
Closes #7823
2018-08-27 10:04:21 -07:00
Brian Behlendorf 5097b4e425 Remove %changelog from spec file
Remove the %changelog section from the spec files since it does
not get updated in the master branch.  Not only does this mean
the information is stale, but it can result in 'make deb' failing
to build packages, issue #7825.  This section should be updated
for tagged releases.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7825
Closes #7827
2018-08-26 12:59:33 -07:00
Joao Carlos Mendes Luis 5d6ad2442b Fedora 28: Fix misc bounds check compiler warnings
Fix a bunch of truncation compiler warnings that show up
on Fedora 28 (GCC 8.0.1).

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7368 
Closes #7826 
Closes #7830
2018-08-26 12:55:44 -07:00
LOLi 644e01a268 Fix libaio-devel requirement for Debian-based distributions
BuildRequires tags for "-devel" packages in the RPM spec file do not
work when building on Debian-based distributions.

Fix this issue by making this requirement conditional to RPM-based
distributions.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7829 
Closes #7831
2018-08-26 12:43:27 -07:00
Brian Behlendorf 55972a6724 Add libaio-devel BuildRequires
The zfs-test package needs a build requirement on the libaio-devel
package.  Without it ./configure will correctly determine that
mmap_libaio cannot be built and it will be skipped.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7821 
Closes #7824
2018-08-23 09:34:34 -07:00
LOLi c434d8806c Stack overflow when destroying deeply nested clones
Destroy operations on deeply nested chains of clones can overflow
the stack:

        Depth    Size   Location    (221 entries)
        -----    ----   --------
  0)    15664      48   mutex_lock+0x5/0x30
  1)    15616       8   mutex_lock+0x5/0x30
...
 26)    13576      72   dsl_dataset_remove_clones_key.isra.4+0x124/0x1e0 [zfs]
 27)    13504      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
 28)    13432      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
...
185)     2128      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
186)     2056      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
187)     1984      72   dsl_dataset_remove_clones_key.isra.4+0x18a/0x1e0 [zfs]
188)     1912     136   dsl_destroy_snapshot_sync_impl+0x4e0/0x1090 [zfs]
189)     1776      16   dsl_destroy_snapshot_check+0x0/0x90 [zfs]
...
218)      304     128   kthread+0xdf/0x100
219)      176      48   ret_from_fork+0x22/0x40
220)      128     128   kthread+0x0/0x100

Fix this issue by converting dsl_dataset_remove_clones_key() from
recursive to iterative.

Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7279 
Closes #7810
2018-08-22 11:03:31 -07:00
Rich Ercolani e8a8208eef Added metadata/dnode cache info to arc_summary
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #7815
2018-08-22 09:35:20 -07:00
Tim Chase 2711b1d05f s/VERIFY/VERIFY3S in vdev_checkpoint_sm_object
Using VERIFY3S allows to view the unexpected error value in the system
log.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #7809 
Closes #7818
2018-08-21 16:08:14 -07:00
Tom Caputi 149ce888bb Fix issues with raw receive_write_byref()
This patch fixes 2 issues with raw, deduplicated send streams. The
first is that datasets who had been completely received earlier in
the stream were not still marked as raw receives. This caused
problems when newly received datasets attempted to fetch raw data
from these datasets without this flag set.

The second problem was that the arc freeze checksum code was not
consistent about which locks needed to be held while performing
its asserts. The proper locking needed to run these asserts is
actually fairly nuanced, since the asserts touch the linked list
of buffers (requiring the header lock), the arc_state (requiring
the b_evict_lock), and the b_freeze_cksum (requiring the
b_freeze_lock). This seems like a large performance sacrifice and
a lot of unneeded complexity to verify that this relatively small
debug feature is working as intended, so this patch simply removes
these asserts instead.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7701
2018-08-20 11:03:56 -07:00
LOLi c962fd6c4e pyzfs: add missing libzfs_core functions
This change adds the following libzfs_core functions to pyzfs:
lzc_remap, lzc_pool_checkpoint, lzc_pool_checkpoint_discard

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7793 
Closes #7800
2018-08-20 10:11:52 -07:00
Olaf Faaland 34fe773e30 Skip import activity test in more zdb code paths
Since zdb opens the pools read-only, it cannot damage the pool in the
event the pool is already imported either on the same host or on
another one.

If the pool vdev structure is changing while zdb is importing the
pool, it may cause zdb to crash.  However this is unlikely, and in any
case it's a user space process and can simply be run again.

For this reason, zdb should disable the multihost activity test on
import that is normally run.

This commit fixes a few zdb code paths where that had been overlooked.
It also adds tests to ensure that several common use cases handle this
properly in the future.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Gu Zheng <guzheng2331314@163.com>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7797 
Closes #7801
2018-08-20 10:05:23 -07:00
DeHackEd edc05fdb34 Don't modify argv[] in user tools
argv[] gets modified during string parsing for input arguments. This
is reflected in the live process listing. Don't do that.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #7760
2018-08-20 09:55:18 -07:00
Serapheim Dimitropoulos a448a2557e Introduce read/write kstats per dataset
The following patch introduces a few statistics on reads and writes
grouped by dataset. These statistics are implemented as kstats
(backed by aggregate sums for performance) and can be retrieved by
using the dataset objset ID number. The motivation for this change is
to provide some preliminary analytics on dataset usage/performance.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #7705
2018-08-20 09:52:37 -07:00
bunder2015 fa84714abb ZTS: events path cleanup
Removing hardcoded paths in events.cfg

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7805
2018-08-18 21:19:41 -07:00
bunder2015 80d45e089c ZTS: largest_pool_001 path cleanup
Removing hardcoded paths in largest_pool_001

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7804
2018-08-18 21:18:31 -07:00
bunder2015 5468ee7a2f ZTS: privilege group path cleanup
Removing hardcoded paths in privilege group tests

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7803
2018-08-18 21:17:22 -07:00
Brian Behlendorf 089b16f48d ZTS: Fix import_cache_device_replaced
Allow the 'zpool replace' to run slowly without overwhelming the vdev
queues by setting zfs_scan_vdev_limit=128k.  This limits the number of
concurrent slow IOs which need to be handled.  The net effect is the
test case runs approximately 3x faster putting it well under the 10
minute per-test time limit.

Rename import_cache* test cases to imprt_cachefile*.  Originally
these were renamed due to a maximum tar name limit, this limit was
removed by commit 1dfde3d9b.

Replaced instances of /var/tmp in zpool_import.cfg with $TEST_BASE_DIR.

Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7765 
Closes #7802
2018-08-18 21:16:12 -07:00
LOLi a9d6270acb 'zfs holds' scripted mode is not documented
This change simply documents the existing "scripted mode" option in
both command help and man page.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7798
2018-08-18 15:47:41 -07:00
LOLi 1f87313ac8 Fix arcstat.py handling of unsupported options
This change allows the arcstat.py script to handle unsupported options
gracefully and print both error and usage messages when one such option
is provided.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7799
2018-08-18 13:10:36 -07:00
Brian Behlendorf 802715b74a ZTS: Fix reservation_001_pos
It's possible for an unrelated process, like blkid, to have the
volume open when 'zfs destroy' is run.  Switch the cleanup function
to the destroy_dataset() helper which handles this case by retrying
the destroy when the dataset is busy.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7796
2018-08-17 10:01:47 -07:00
Brian Behlendorf 4338c5c06f Fix traverse_impl() kmem leak
The error path must free the memory allocated by this function or
it will be leaked.  In practice, this would leak only a few bytes
of memory under rare circumstances and thus is unlikely to have
caused any real problems.  This issue was caught by the kmemleak.

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7791
2018-08-15 09:53:44 -07:00
Brian Behlendorf 1dfde3d9b2 Use posix format for dist tarballs
Traditionally Automake has defaulted to the V7 tar format when
creating tarballs for distributions.  One of the many limitions
of this format is a 99 character maximum path + file name limit.
This can cause problems when adding new test cases to the ZTS
due to the depth of the sub-tree and descriptive test names.

This change switches the build system to the posix (aliased as
pax) tar format which conforms to the POSIX.1-2001 specification.
This format does not suffer from the V7 limitations, was designed
to be compatible, and will become the default format in future
versions of GNU tar.

https://www.gnu.org/software/tar/manual/html_chapter/tar_8.html

As part of this change the blockfiles directories which were
originally removed due to this limit have been readded.

Reviewed by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7767
2018-08-15 09:52:28 -07:00
Tom Caputi 1fff937a4c Check encrypted dataset + embedded recv earlier
This patch fixes a bug where attempting to receive a send stream
with embedded data into an encrypted dataset would not cleanup
that dataset when the error was reached. The check was moved into
dmu_recv_begin_check(), preventing this issue.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7650
2018-08-15 09:49:19 -07:00
Tom Caputi d9c460a0b6 Added encryption support for zfs recv -o / -x
One small integration that was absent from b52563 was
support for zfs recv -o / -x with regards to encryption
parameters. The main use cases of this are as follows:

* Receiving an unencrypted stream as encrypted without
  needing to create a "dummy" encrypted parent so that
  encryption can be inheritted.

* Allowing users to change their keylocation on receive,
  so long as the receiving dataset is an encryption root.

* Allowing users to explicitly exclude or override the
  encryption property from an unencrypted properties stream,
  allowing it to be received as encrypted.

* Receiving a recursive heirarchy of unencrypted datasets,
  encrypting the top-level one and forcing all children to
  inherit the encryption.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7650
2018-08-15 09:48:49 -07:00
Tomohiro Kusumi fe8a7982ca Fix comment on calculating blkid
Fix comment on calculating blkid at level n within dnode's blkptrs.
"(2^(level*(indblkshift - SPA_BLKPTRSHIFT)" is part of divisor
in this division.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7768
2018-08-13 13:33:47 -07:00
bunder2015 64e96969a8 ZTS: delegate group path cleanup
Removing hardcoded paths in delegate group tests

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7778
2018-08-13 08:24:02 -07:00
bunder2015 d02fa81c78 ZTS: acl group path cleanup
Removing hardcoded paths in acl group tests

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7777
2018-08-13 08:22:41 -07:00
bunder2015 604016054e ZTS: inuse_004 path cleanup
Removing hardcoded path in inuse_004

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7775
2018-08-13 08:16:55 -07:00
bunder2015 5fba582887 ZTS: projectquota_002 path cleanup
Removing hardcoded path in projectquota_002

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7774
2018-08-13 08:15:53 -07:00
Olaf Faaland 5200237711 MMP should not suspend pool in ztest
When running ztest, never suspend the pool due to failed or delayed
MMP writes.

There are many sources of long delays within ztest, such as device
opens, closes, etc. which in combination, may delay MMP writes too
long and cause MMP to suspend the pool.

Some of these delays also affect real pools, and should be fixed.
That is being worked separately.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7776
2018-08-13 08:12:25 -07:00
Brian Behlendorf 94b197a0a5 ZTS: Test case reliability
* Both cli_root/zpool_import/import_cache_device_replaced, and
  redundancy/redundancy_004_neg have been observed to fail for
  spurious reasons ~1% of the time.  Add them to the exception
  list and reference the open Github issue.

* Speed up replacement/replacement_001_pos to prevent it from
  exceeding the 10 minute per test limit and getting KILLED.
  File vdev creation switched to truncate -s, redundant raidz1
  testing pass dropped, fixed some minor formating issues.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7766
2018-08-12 09:38:53 -07:00
LOLi c8c308362c Allow inherited properties in zfs_check_settable()
This change modifies how 'checksum' and 'dedup' properties are verified
in zfs_check_settable() handling the case where they are explicitly
inherited in the dataset hierarchy when receiving a recursive send
stream.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7755 
Closes #7576 
Closes #7757
2018-08-03 14:56:25 -07:00
Don Brady fc1ecd16d7 zfs_ioc_unload_key can drop extra spa ref
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7759
2018-08-03 14:50:51 -07:00
Brian Behlendorf 6da0998f59 ZTS: Fix zfs_create_007_pos
It's possible for an unrelated process, like blkid, to have the
volume open when 'zfs destroy' is run.  Switch the cleanup function
to the destroy_dataset() helper which handles this case by retrying
the destroy when the dataset is busy.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7763
2018-08-03 10:21:50 -07:00
Matthew Ahrens 62840030a7 Reduce taskq and context-switch cost of zio pipe
When doing a read from disk, ZFS creates 3 ZIO's: a zio_null(), the
logical zio_read(), and then a physical zio. Currently, each of these
results in a separate taskq_dispatch(zio_execute).

On high-read-iops workloads, this causes a significant performance
impact. By processing all 3 ZIO's in a single taskq entry, we reduce the
overhead on taskq locking and context switching.  We accomplish this by
allowing zio_done() to return a "next zio to execute" to zio_execute().

This results in a ~12% performance increase for random reads, from
96,000 iops to 108,000 iops (with recordsize=8k, on SSD's).

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-59292
Closes #7736
2018-08-02 15:51:45 -07:00
John Gallagher 499b5497cb Add missing checks to zpl_xattr_* functions
Linux specific zpl_* entry points, such as xattrs, must include
the same unmounted and sa handle checks as the common zfs_ entry
points. The additional ZPL_* wrappers are identical to their
ZFS_ counterparts except the errno is negated since they are
expected to be used at the zpl_ layer.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #5866 
Closes #7761
2018-08-02 14:03:56 -07:00
Nathan Lewis 010d12474c Add support for selecting encryption backend
- Add two new module parameters to icp (icp_aes_impl, icp_gcm_impl)
  that control the crypto implementation.  At the moment there is a
  choice between generic and aesni (on platforms that support it).
- This enables support for AES-NI and PCLMULQDQ-NI on AMD Family
  15h (bulldozer) and newer CPUs (zen).
- Modify aes_key_t to track what implementation it was generated
  with as key schedules generated with various implementations
  are not necessarily interchangable.

Reviewed by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Nathaniel R. Lewis <linux.robotdude@gmail.com>
Closes #7102 
Closes #7103
2018-08-02 11:59:24 -07:00
George Wilson 3d503a76e8 Fix OpenZFS 9337 mismerge
This change reintroduces logic required by OpenZFS 9577. When
OpenZFS 9337, zfs get all is slow due to uncached metadata, was
merged in it ended up removing logic required by OpenZFS 9577,
remove zfs_dbuf_evict_key, and inadvertently reintroduced the
bug that 9577 was designed to fix.

This change re-enables the "evicting" flag to dbuf_rele_and_unlock
and dnode_rele_and_unlock and updates all callers to provide the
correct parameter.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Closes #7758
2018-08-02 10:21:48 -07:00
Rohan Puri fd7265c646 Fix deadlock between zfs umount & snapentry_expire
zfs umount -> zfsctl_destroy() takes the zfs_snapshot_lock as a
writer and calls zfsctl_snapshot_unmount_cancel(), which waits
for snapentry_expire() if present (when snap is automounted).
This snapentry_expire() itself then waits for zfs_snapshot_lock
as a reader, resulting in a deadlock.

The fix is to only hold the zfs_snapshot_lock over the tree
lookup and removal.  After a successful lookup the lock can
be dropped and zfs_snapentry_t will remain valid until the
reference taken by the lookup is released.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rohan Puri <rohan.puri15@gmail.com>
Closes #7751
Closes #7752
2018-08-01 15:00:02 -07:00
Paul Dagnelie 492f64e941 OpenZFS 9112 - Improve allocation performance on high-end systems
Overview
========

We parallelize the allocation process by creating the concept of
"allocators". There are a certain number of allocators per metaslab
group, defined by the value of a tunable at pool open time.  Each
allocator for a given metaslab group has up to 2 active metaslabs; one
"primary", and one "secondary". The primary and secondary weight mean
the same thing they did in in the pre-allocator world; primary metaslabs
are used for most allocations, secondary metaslabs are used for ditto
blocks being allocated in the same metaslab group.  There is also the
CLAIM weight, which has been separated out from the other weights, but
that is less important to understanding the patch.  The active metaslabs
for each allocator are moved from their normal place in the metaslab
tree for the group to the back of the tree. This way, they will not be
selected for use by other allocators searching for new metaslabs unless
all the passive metaslabs are unsuitable for allocations.  If that does
happen, the allocators will "steal" from each other to ensure that IOs
don't fail until there is truly no space left to perform allocations.

In addition, the alloc queue for each metaslab group has been broken
into a separate queue for each allocator. We don't want to dramatically
increase the number of inflight IOs on low-end systems, because it can
significantly increase txg times. On the other hand, we want to ensure
that there are enough IOs for each allocator to allow for good
coalescing before sending the IOs to the disk.  As a result, we take a
compromise path; each allocator's alloc queue max depth starts at a
certain value for every txg. Every time an IO completes, we increase the
max depth. This should hopefully provide a good balance between the two
failure modes, while not dramatically increasing complexity.

We also parallelize the spa_alloc_tree and spa_alloc_lock, which cause
very similar contention when selecting IOs to allocate. This
parallelization uses the same allocator scheme as metaslab selection.

Performance Results
===================

Performance improvements from this change can vary significantly based
on the number of CPUs in the system, whether or not the system has a
NUMA architecture, the speed of the drives, the values for the various
tunables, and the workload being performed. For an fio async sequential
write workload on a 24 core NUMA system with 256 GB of RAM and 8 128 GB
SSDs, there is a roughly 25% performance improvement.

Future Work
===========

Analysis of the performance of the system with this patch applied shows
that a significant new bottleneck is the vdev disk queues, which also
need to be parallelized.  Prototyping of this change has occurred, and
there was a performance improvement, but more work needs to be done
before its stability has been verified and it is ready to be upstreamed.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>

Porting Notes:
* Fix reservation test failures by increasing tolerance.

OpenZFS-issue: https://illumos.org/issues/9112
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f3cc3c3
Closes #7682
2018-07-31 10:52:33 -07:00
Brian Behlendorf 3905caceaf Add missing zfs-dracut RPM dependencies
The zfs-dracut package requires the hostid, basename, head, awk,
and grep utilities be installed.  The first three are provided by
coreutils but additional dependencies are required for awk and grep.

Reviewed-by: Manuel Amador (Rudd-O) <rudd-o@rudd-o.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7729 
Closes #7747
2018-07-31 10:17:44 -07:00
Antonio Russo 9b9d1adc38 Use zfs-import.target in contrib/dracut
The new zfs-import.target should be used in place of the
zfs-import-*.service units.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Manuel Amador (Rudd-O) <rudd-o@rudd-o.com>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #6964
2018-07-31 10:15:41 -07:00
Don Brady dae3e9ea21 OpenZFS 9465 - ARC check for 'anon_size > arc_c/2' can stall the system
In the case of one pool being built on another pool, we want
to make sure we don't end up throttling the lower (backing)
pool when the upper pool is the majority contributor to dirty
data. To insure we make forward progress during throttling, we
also check the current pool's net dirty data and only throttle
if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty
data in the cache.

Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
* The new global variables zfs_arc_dirty_limit_percent,
  zfs_arc_anon_limit_percent, and zfs_arc_pool_dirty_percent
  were intentially not added as tunable module parameters.

OpenZFS-issue: https://illumos.org/issues/9465
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6a4c3ef
Closes #7749
2018-07-30 11:30:41 -07:00
Serapheim Dimitropoulos 6b64382b17 OpenZFS 9580 - Add a hash-table on top of nvlist to speed-up operations
= Motivation

While dealing with another performance issue (see 126118f) we noticed
that we spend a lot of time in various places in the kernel when
constructing long nvlists. The problem is that when an nvlist is created
with the NV_UNIQUE_NAME set (which is the case most of the time), we do
a linear search through the whole list to ensure uniqueness for every
entry we add.

An example of the above scenario can be seen in the following
flamegraph, where more than have the time of the zfsdev_ioctl() is spent
on constructing nvlists.  Flamegraph:
https://sdimitro.github.io/img/flame/sdimitro_snap_unmount3.svg

Adding a table to speed up lookups will help situations where we just
construct an nvlist (like the scenario above), in addition to regular
lookups and removals.

= What this patch does

In this diff we've implemented a hash-table on top of the nvlist code
that converts most nvlist operations from O(# number of entries) to
O(1)* (the start is for amortized time as the hash-table grows and
shrinks depending on the # of entries - plain lookup is strictly O(1)).

= Performance Analysis

To analyze the performance improvement I just used the setup from the
snapshot deletion issue mentioned above in the Motivation section.
Basically I created 10K filesystems with one snapshot each and then I
just used the API of libZFS_Core to pass down an nvlist of all the
snapshots to have them deleted. The reason I used my own driver program
was to have clean performance results of what actually happens in the
kernel. The flamegraphs and wall clock times mentioned below were
gathered from the start to the end of the driver program's run. Between
trials the testpool used was completely destroyed, the system was
rebooted and the testpool was completely recreated. The reason for this
dance was to get consistent results.

== Results (before patch):

=== Sampling Flamegraphs

[Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A.svg
[Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2.svg
[Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3.svg

=== Wall clock times (in seconds)

```
[Trial 4]
real        5.3
user        0.4
sys         2.3

[Trial 5]
real        8.2
user        0.4
sys         2.4

[Trial 6]
real        6.0
user        0.5
sys         2.3
```

== Results (after patch):

=== Sampling Flamegraphs

[Trial 1] https://sdimitro.github.io/img/flame/DLPX-53417/trial-Ae.svg
[Trial 2] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A2e.svg
[Trial 3] https://sdimitro.github.io/img/flame/DLPX-53417/trial-A3e.svg

=== Wall clock times (in seconds)

```
[Trial 4]
real        4.9
user        0.0
sys         0.9

[Trial 5]
real        3.8
user        0.0
sys         0.9

[Trial 6]
real        3.6
user        0.0
sys         0.9
```

== Analysis

The results between the trials are consistent so in this sections I will
only talk about the flamegraph results from trial-1 and the wall-clock
results from trial-4.

From trial-1 we can see that zfs_dev_ioctl() goes from 2,331 to 996
samples counts.  Specifically, the samples from fnvlist_add_nvlist() and
spa_history_log_nvl() are almost gone (~500 & ~800 to 5 & 5 samples),
leaving zfs_ioc_destroy_snaps() to dominate most samples from
zfs_dev_ioctl().

From trial-4 we see that the user time dropped to 0 secods. I believe
the consistent 0.4 seconds before my patch was applied was due to my
driver program constructing the long nvlist of snapshots so it can pass
it to the kernel. As for the system time, the effect there is more clear
(2.3 down to 0.9 seconds).

Porting Notes:
* DATA_TYPE_DONTCARE case added to switch in fm_nvprintr() and
  zpool_do_events_nvprint().

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9580
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b5eca7b1
Closes #7748
2018-07-30 11:30:03 -07:00
Matthew Ahrens 1897bc0d48 OpenZFS 9439 - ZFS double-free due to failure to dirty indirect block
Follow up commit for OpenZFS 9438.  See the OpenZFS-issue link below
for a complete analysis.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9439
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/779220d
External-issue: DLPX-46861
Closes #7746
2018-07-30 09:28:09 -07:00
Paul Dagnelie 21d48b5eac OpenZFS 9438 - Holes can lose birth time info if a block has a mix of birth times
As reported by https://github.com/zfsonlinux/zfs/issues/4996, there is
yet another hole birth issue. In this one, if a block is entirely holes,
but the birth times are not all the same, we lose that information by
creating one hole with the current txg as its birth time.

The ZoL PR's fix approach is incorrect. Ultimately, the problem here is
that when you truncate and write a file in the same transaction group,
the dbuf for the indirect block will be zeroed out to deal with the
truncation, and then written for the write. During this process, we will
lose hole birth time information for any holes in the range. In the case
where a dnode is being freed, we need to determine whether the block
should be converted to a higher-level hole in the zio pipeline, and if
so do it when the dnode is being synced out.

Porting Notes:
* The DMU_OBJECT_END change in zfs_znode.c was already applied.
* Added test cases from #5675 provided by @rincebrain for hole_birth
  issues.  These test cases should be pushed upstream to OpenZFS.
* Updated mk_files which is used by several rsend tests so the
  files created are a little more interesting and may contain holes.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9438
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/738e2a3c
External-issue: DLPX-46861
Closes #7746
2018-07-30 09:27:49 -07:00
Brian Behlendorf b719768e35 ZTS: Fix reservation_017_pos
It's possible for an unrelated process, like blkid, to have the
volume open when 'zfs destroy' is run.  Switch the cleanup function
to the destroy_dataset() helper which handles this case by retrying
the destroy when the dataset is busy.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7750
2018-07-30 09:23:45 -07:00
Brian Behlendorf 11d0525cbb Add rwsem_tryupgrade for 4.9.20-rt16 kernel
The RT rwsem implementation was changed to allow multiple readers
as of the 4.9.20-rt16 patch set.  This results in a build failure
because the existing implementation was forced to directly access
the rwsem structure which has changed.

While this could be accommodated by adding additional compatibility
code.  This patch resolves the build issue by simply assuming the
rwsem can never be upgraded.  This functionality is a performance
optimization and all callers must already handle this case.

Converting the last remaining use of __SPIN_LOCK_UNLOCKED to
spin_lock_init() was additionally required to get a clean build.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7589
2018-07-30 09:22:30 -07:00
George Diamantopoulos fb7307b892 Fix initramfs missing systemd binaries
Systemd binaries necessary for mounting an encrypted root dataset
weren't copied to initramfs generated by dracut. This patch fixes
this and copies these binaries unconditionally, that is
regardless of whether native ZFS encryption is used for the
root dataset.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Diamantopoulos <georgediam@gmail.com>
Closes #7607
Closes #7719
2018-07-27 09:29:43 -07:00
Toomas Soome 5fadb7fb0c OpenZFS 8906 - uts: illumos rootfs should support salted cksum
Porting notes:
* As of grub-2.02 these checksums are not supported.  However, as
  pointed out in #6501 there are alternatives such as EFISTUB which
  work and have no such restriction.  A warning was added to the
  checksum property section of the zfs.8 man page.

Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: C Fraire <cfraire@me.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/8906
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f
Closes #6501
Closes #7714
2018-07-27 08:35:28 -07:00
Matthew Ahrens 3a549dc7a1 OpenZFS 9442 - decrease indirect block size of spacemaps
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Albert Lee <trisk@forkgnu.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Updates to indirect blocks of spacemaps can contribute significantly to
write inflation.  Therefore we want to reduce the indirect block size of
spacemaps from 128K to 16K.

Porting notes:
* Refactored to allow the dmu_object_alloc(), dmu_object_alloc_ibs()
  and dmu_object_alloc_dnsize() functions to use a common shared
  dmu_object_alloc_impl() function.

OpenZFS-issue: https://www.illumos.org/issues/9442
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0c2e6408b
Closes #7712
2018-07-25 14:11:35 -07:00
Brian Behlendorf e106a7bacb ZTS: Add reservation_008_pos exception
The reservation_008_pos test case has been observed to fail in
a non-dangerous way in approximately 5% of automated test runs.
Add the test case to the list of possible expected failures
until the test case can be made perfectly reliable.

Reviewed by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7741 
Closes #7742
2018-07-25 10:15:39 -07:00
Feng Sun 750e1f88d3 Introduce kstat dmu_tx_dirty_frees_delay
It is helpful to tune zfs_per_txg_dirty_frees_percent for commit
539d33c7(OpenZFS 6569 - large file delete can starve out write ops).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Feng Sun <loyou85@gmail.com>
Closes #7718
2018-07-25 09:52:27 -07:00
sara hartse 473c976a0c OpenZFS 9457 - libzfs_import.c:add_config() has a memory leak
A memory leak occurs on lines 209 and 213 because the config is not
freed in the error case.  The interface to add_config() seems less than
ideal - it would be better if it copied any data necessary from the
config and the caller freed it.

Porting notes:
* This issue had already been resolved on Linux by adding the missing
  calls to nvlist_free().  But we'll adopt the upstream fix to keep
  the behavior of the code consistent.

Authored by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9457
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/be86bb8a
Closes #7713
2018-07-24 17:12:06 -07:00
Matthew Ahrens 802b1a7b3b OpenZFS 9338 - moved dnode has incorrect dn_next_type
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

While investigating a different problem, I noticed that moved dnodes
(those processed by dnode_move_impl() via kmem_move()) have an incorrect
dn_next_type. This could cause the on-disk dn_type to be changed to an
invalid value. The fix to copy the dn_next_type in dnode_move_impl().

Porting notes:
* For the moment this potential issue cannot occur on Linux since
  the SPL does not provide the kmem_move() functionality.

OpenZFS-issue: https://illumos.org/issues/9338
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0717e6f13
Closes #7715
2018-07-24 17:10:42 -07:00
Tom Caputi b7ddeaef3d Refactor arc_hdr_realloc_crypt()
The arc_hdr_realloc_crypt() function is responsible for converting
a "full" arc header to an extended "crypt" header and visa versa.
This code was originally written with a bcopy() so that any new
members added to arc headers would automatically be included
without requiring a code change. However, in practice this (along
with small differences in kmem_cache implementations between
various platforms) has caused a number of hard-to-find problems in
ports to other operating systems. This patch solves this problem
by making all member copies explicit and adding ASSERTs for fields
that cannot be set during the transfer. It also manually resets the
old header after the reallocation is finished so it can be properly
reallocated and reused.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7711
2018-07-24 12:20:04 -07:00
Steven Noonan 863522b1f9 dsl_scan_scrub_cb: don't double-account non-embedded blocks
We were doing count_block() twice inside this function, once
unconditionally at the beginning (intended to catch the embedded block
case) and once near the end after processing the block.

The double-accounting caused the "zpool scrub" progress statistics in
"zpool status" to climb from 0% to 200% instead of 0% to 100%, and
showed double the I/O rate it was actually seeing.

This was apparently a regression introduced in commit 00c405b4b5,
which was an incorrect port of this OpenZFS commit:

    https://github.com/openzfs/openzfs/commit/d8a447a7

Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Steven Noonan <steven@uplinklabs.net>
Closes #7720 
Closes #7738
2018-07-24 09:33:56 -07:00
Matthew Ahrens fda0f16217 PR's should provide motivation & context first
It's often necessary to understand why a change is made, before
understanding the exact changes that are made.  Context provides
background, which by definition is necessary to understand prior to the
substance of the Pull Request.

Change the PR template to request "Motivation and Context" first, before
"Description".

Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7737
2018-07-23 22:08:18 -07:00
Brian Behlendorf d441e85dd7 Add support for autoexpand property
While the autoexpand property may seem like a small feature it
depends on a significant amount of system infrastructure.  Enough
of that infrastructure is now in place that with a few modifications
for Linux it can be supported.

Auto-expand works as follows; when a block device is modified
(re-sized, closed after being open r/w, etc) a change uevent is
generated for udev.  The ZED, which is monitoring udev events,
passes the change event along to zfs_deliver_dle() if the disk
or partition contains a zfs_member as identified by blkid.

From here the device is matched against all imported pool vdevs
using the vdev_guid which was read from the label by blkid.  If
a match is found the ZED reopens the pool vdev.  This re-opening
is important because it allows the vdev to be briefly closed so
the disk partition table can be re-read.  Otherwise, it wouldn't
be possible to report the maximum possible expansion size.

Finally, if the property autoexpand=on a vdev expansion will be
attempted.  After performing some sanity checks on the disk to
verify that it is safe to expand,  the primary partition (-part1)
will be expanded and the partition table updated.  The partition
is then re-opened (again) to detect the updated size which allows
the new capacity to be used.

In order to make all of the above possible the following changes
were required:

* Updated the zpool_expand_001_pos and zpool_expand_003_pos tests.
  These tests now create a pool which is layered on a loopback,
  scsi_debug, and file vdev.  This allows for testing of non-
  partitioned block device (loopback), a partition block device
  (scsi_debug), and a file which does not receive udev change
  events.  This provided for better test coverage, and by removing
  the layering on ZFS volumes there issues surrounding layering
  one pool on another are avoided.

* zpool_find_vdev_by_physpath() updated to accept a vdev guid.
  This allows for matching by guid rather than path which is a
  more reliable way for the ZED to reference a vdev.

* Fixed zfs_zevent_wait() signal handling which could result
  in the ZED spinning when a signal was not handled.

* Removed vdev_disk_rrpart() functionality which can be abandoned
  in favor of kernel provided blkdev_reread_part() function.

* Added a rwlock which is held as a writer while a disk is being
  reopened.  This is important to prevent errors from occurring
  for any configuration related IOs which bypass the SCL_ZIO lock.
  The zpool_reopen_007_pos.ksh test case was added to verify IO
  error are never observed when reopening.  This is not expected
  to impact IO performance.

Additional fixes which aren't critical but were discovered and
resolved in the course of developing this functionality.

* Added PHYS_PATH="/dev/zvol/dataset" to the vdev configuration for
  ZFS volumes.  This is as good as a unique physical path, while the
  volumes are not used in the test cases anymore for other reasons
  this improvement was included.

Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #120
Closes #2437
Closes #5771
Closes #7366
Closes #7582
Closes #7629
2018-07-23 15:40:15 -07:00
Matthew Ahrens 2e5dc449c1 OpenZFS 9337 - zfs get all is slow due to uncached metadata
This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much). There are two parts:

The dbuf_metadata_cache. We identify what to put into the cache based on
the object type of each dbuf.  Caching objset properties os
{version,normalization,utf8only,casesensitivity} in the objset_t. The reason
these needed to be cached is that although they are queried frequently,
they aren't stored in a dbuf type which we can easily recognize and cache in
the dbuf layer; instead, we have to explicitly store them. There's already
existing infrastructure for maintaining cached properties in the objset
setup code, so I simply used that.

Performance Testing:

 - Disabled kmem_flags
 - Tuned dbuf_cache_max_bytes very low (128K)
 - Tuned zfs_arc_max very low (64M)

Created test pool with 400 filesystems, and 100 snapshots per filesystem.
Later on in testing, added 600 more filesystems (with no snapshots) to make
sure scaling didn't look different between snapshots and filesystems.

Results:

    | Test                   | Time (trunk / diff) | I/Os (trunk / diff) |
    +------------------------+---------------------+---------------------+
    | zpool import           |     0:05 / 0:06     |    12.9k / 12.9k    |
    | zfs get all (uncached) |     1:36 / 0:53     |    16.7k / 5.7k     |
    | zfs get all (cached)   |     1:36 / 0:51     |    16.0k / 6.0k     |

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>

OpenZFS-issue: https://illumos.org/issues/9337
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7dec52f
Closes #7668
2018-07-12 10:49:27 -07:00
Don Brady e4e94ca315 OpenZFS 9426 - metaslab size can exceed offset addressable by spacemap
Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>

OpenZFS-issue: https://www.illumos.org/issues/9426
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f1c88afb1
Closes #7700
2018-07-11 15:55:48 -07:00
Andriy Gapon e902ddb0f8 OpenZFS 9479 - fix wrong format specifier for vdev_id
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>

OpenZFS-issue: https://www.illumos.org/issues/9479
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/20aa447c
Closes #7699
2018-07-11 15:53:02 -07:00
Brian Behlendorf ac09630d8b Fix zpl_mount() deadlock
Commit 93b43af10 inadvertently introduced the following scenario which
can result in a deadlock.  This issue was most easily reproduced by
LXD containers using a ZFS storage backend but should be reproducible
under any workload which is frequently mounting and unmounting.

-- THREAD A --
spa_sync()
  spa_sync_upgrades()
    rrw_enter(&dp->dp_config_rwlock, RW_WRITER, FTAG); <- Waiting on B

-- THREAD B --
mount_fs()
  zpl_mount()
    zpl_mount_impl()
      dmu_objset_hold()
        dmu_objset_hold_flags()
          dsl_pool_hold()
            dsl_pool_config_enter()
              rrw_enter(&dp->dp_config_rwlock, RW_READER, tag);
    sget()
      sget_userns()
        grab_super()
          down_write(&s->s_umount); <- Waiting on C

-- THREAD C --
cleanup_mnt()
  deactivate_super()
    down_write(&s->s_umount);
    deactivate_locked_super()
      zpl_kill_sb()
        kill_anon_super()
          generic_shutdown_super()
            sync_filesystem()
              zpl_sync_fs()
                zfs_sync()
                  zil_commit()
                    txg_wait_synced() <- Waiting on A

Reviewed by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7598 
Closes #7659 
Closes #7691 
Closes #7693
2018-07-11 15:49:10 -07:00
Brian Behlendorf 33a19e0fd9 Fix kernel unaligned access on sparc64
Update the SA_COPY_DATA macro to check if architecture supports
efficient unaligned memory accesses at compile time.  Otherwise
fallback to using the sa_copy_data() function.

The kernel provided CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS is
used to determine availability in kernel space.  In user space
the x86_64, x86, powerpc, and sometimes arm architectures will
define the HAVE_EFFICIENT_UNALIGNED_ACCESS macro.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7642 
Closes #7684
2018-07-11 13:10:40 -07:00
Matthew Ahrens 2dca37d8dc OpenZFS 9424 - ztest failure: "unprotected error in call to Lua API (Invalid value type 'function' for key 'error')"
Ztest failed with the following crash.

    ::status

    debugging core file of ztest (64-bit) from clone-dc-slave-280-bc7947b1.dcenter
    file: /usr/bin/amd64/ztest
    initial argv: /usr/bin/amd64/ztest
    threading model: raw lwps
    status: process terminated by SIGABRT (Abort), pid=2150 uid=1025 code=-1
    panic message: failure for thread 0xfffffd7fff112a40, thread-id 1: unprotected error in call to Lua API (Invalid
    value type 'function' for key 'error')

    ::stack

    libc.so.1`_lwp_kill+0xa()
    libc.so.1`_assfail+0x182(fffffd7fffdfe8d0, 0, 0)
    libc.so.1`assfail+0x19(fffffd7fffdfe8d0, 0, 0)
    libzpool.so.1`vpanic+0x3d(fffffd7ffaa58c20, fffffd7fffdfeb00)
    0xfffffd7ffaa28146()
    0xfffffd7ffaa0a109()
    libzpool.so.1`luaD_throw+0x86(3011a48, 2)
    0xfffffd7ffa9350d3()
    0xfffffd7ffa93e3f1()
    libzpool.so.1`zcp_lua_to_nvlist+0x33(3011a48, 1, 2686470, fffffd7ffaa2e2c3)
    libzpool.so.1`zcp_convert_return_values+0xa4(3011a48, 2686470, fffffd7ffaa2e2c3, fffffd7fffdfedd0)
    libzpool.so.1`zcp_pool_error+0x59(fffffd7fffdfedd0, 1e0f450)
    libzpool.so.1`zcp_eval+0x6f8(1e0f450, fffffd7ffaa483f8, 1, 0, 6400000, 1d33b30)
    libzpool.so.1`dsl_destroy_snapshots_nvl+0x12c(2786b60, 0, 484750)
    libzpool.so.1`dsl_destroy_snapshot+0x4f(fffffd7fffdfef70, 0)
    ztest_dsl_dataset_cleanup+0xea(fffffd7fffdff4c0, 1)
    ztest_dataset_destroy+0x53(1)
    ztest_run+0x59f(fffffd7fff0e0498)
    main+0x7ff(1, fffffd7fffdffa88)
    _start+0x6c()

The problem is that zcp_convert_return_values() assumes that there's
exactly one value on the stack, but that isn't always true. It ends up
putting the wrong thing on the stack which is then consumed by
zcp_convert_return values, which either adds the wrong message to the
nvlist, or blows up.

The fix is to make sure that callers of zcp_convert_return_values()
clear the stack before pushing their error message, and
zcp_convert_return_values() should VERIFY that the stack is the expected
size.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>

OpenZFS-issue: https://www.illumos.org/issues/9424
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/eb7e57429
Closes #7696
2018-07-10 21:29:23 -07:00
Brian Behlendorf e2cc448b60 Reduce zdb output when pool contains checkpoint
When running zdb without additional arguments against a pool containing
a checkpoint the entire checkpoint spacemap should not be dumped.  Make
this behavior conditional upon passing the -mmmm option as described in
the zdb(8) man page.

     -mmmm   Display every spacemap record.

Reviewed-by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7702
2018-07-10 21:23:17 -07:00
Matthew Ahrens 00c405b4b5 OpenZFS 9454 - ::zfs_blkstats should count embedded blocks
When we do a scrub or resilver, ZFS counts the different types of blocks,
which can be printed by the ::zfs_blkstats mdb dcmd. However, it fails to
count embedded blocks.

Porting notes:
* Commit d4a72f23 moved count_blocks under a BP_IS_EMBEDDED conditional
  as part of the sequential resilver functionality.  Since phys_birth
  would be zero that case should never happen as described above.  This
  is confirmed by the code coverage analysis.  Remove the conditional
  to realign that aspect of this function with OpenZFS.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>

OpenZFS-issue: https://www.illumos.org/issues/9454
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d8a447a7
Closes #7697
2018-07-10 10:41:38 -07:00
Prakash Surya ab11916583 OpenZFS 9456 - ztest failure in zil_commit_waiter_timeout
Problem
=======

Illumos bug 8373 was integrated, which now presents a code path where
"dmu_tx_assign" can fail.  When "dmu_tx_assign" fails, it will not issue
the lwb that was passed in to "zil_lwb_write_issue".  As a result, when
"zil_lwb_write_issue" returns, the lwb will still be in the "opened"
state, just as it was when "zil_lwb_write_issue" was originally called.

Solution
========

As a result of this new call path, the failed assertion needs to be
modified to be aware of this new possibility. Thus, we can only assert
that the lwb is no longer in the "opened" state if the returned lwb is
non-null, since we cannot differentiate between the case of
"dmu_tx_assign" failing or "zio_alloc_zil" failing within the call to
"zil_lwb_write_issue".

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/9456
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a8b09f4e
Closes #7695
2018-07-10 10:25:14 -07:00
Serapheim Dimitropoulos a7ed98d8b5 OpenZFS 9330 - stack overflow when creating a deeply nested dataset
Datasets that are deeply nested (~100 levels) are impractical. We just
put a limit of 50 levels to newly created datasets. Existing datasets
should work without a problem.

The problem can be seen by attempting to create a dataset using the -p
option with many levels:

    panic[cpu0]/thread=ffffff01cd282c20: BAD TRAP: type=8 (#df Double fault) rp=ffffffff

    fffffffffbc3aa60 unix:die+100 ()
    fffffffffbc3ab70 unix:trap+157d ()
    ffffff00083d7020 unix:_patch_xrstorq_rbx+196 ()
    ffffff00083d7050 zfs:dbuf_rele+2e ()
    ...
    ffffff00083d7080 zfs:dsl_dir_close+32 ()
    ffffff00083d70b0 zfs:dsl_dir_evict+30 ()
    ffffff00083d70d0 zfs:dbuf_evict_user+4a ()
    ffffff00083d7100 zfs:dbuf_rele_and_unlock+87 ()
    ffffff00083d7130 zfs:dbuf_rele+2e ()
    ... The block above repeats once per directory in the ...
    ... create -p command, working towards the root ...
    ffffff00083db9f0 zfs:dsl_dataset_drop_ref+19 ()
    ffffff00083dba20 zfs:dsl_dataset_rele+42 ()
    ffffff00083dba70 zfs:dmu_objset_prefetch+e4 ()
    ffffff00083dbaa0 zfs:findfunc+23 ()
    ffffff00083dbb80 zfs:dmu_objset_find_spa+38c ()
    ffffff00083dbbc0 zfs:dmu_objset_find+40 ()
    ffffff00083dbc20 zfs:zfs_ioc_snapshot_list_next+4b ()
    ffffff00083dbcc0 zfs:zfsdev_ioctl+347 ()
    ffffff00083dbd00 genunix:cdev_ioctl+45 ()
    ffffff00083dbd40 specfs:spec_ioctl+5a ()
    ffffff00083dbdc0 genunix:fop_ioctl+7b ()
    ffffff00083dbec0 genunix:ioctl+18e ()
    ffffff00083dbf10 unix:brand_sys_sysenter+1c9 ()

Porting notes:
* Added zfs_max_dataset_nesting module option with documentation.
* Updated zfs_rename_014_neg.ksh for Linux.
* Increase the zfs.sh stack warning to 15K.  Enough time has passed
  that 16K can be reasonably assumed to be the default value.  It
  was increased in the 3.15 kernel released in June of 2014.

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>

OpenZFS-issue: https://www.illumos.org/issues/9330
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/757a75a
Closes #7681
2018-07-09 13:02:50 -07:00
Brian Behlendorf 66df02497c ZTS: clean_mirror and scrub_mirror cleanup
Remove the dependency on partitionable devices for the clean_mirror
and scrub_mirror test cases.  This allows for the setup and cleanup
of the test cases to be simplified by removing the need for complex
partitioning.

This change also resolves a issue where the clean_mirror devices
were not being properly damaged since the device name was not a
full path.  The result being loopX files were being left in the
top level test_results directory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7434 
Closes #7690
2018-07-09 12:46:14 -07:00
Troels Nørgaard 94370f5955 Default ashift for Amazon EC2 NVMe devices
Add a default 4 KiB ashift for Amazon EC2 NVMe devices on instances with
NVMe ephemeral devices, such as the types c5d, f1, i3 and m5d.
As per the official documentation [1] a 4096 byte blocksize should be
used to match the underlying hardware.

The string was identified via:

$ sudo sginfo -M /dev/nvme0n1
INQUIRY response (cmd: 0x12)
----------------------------
Device Type                        0
Vendor:                    NVMe
Product:                   Amazon EC2 NVMe
Revision level:

$ lsblk -io KNAME,TYPE,SIZE,MODEL
KNAME   TYPE    SIZE MODEL
nvme0n1 disk  442.4G Amazon EC2 NVMe Instance Storage

[1] https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/
    storage-optimized-instances.html
    Retrived 2018-07-03

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Troels Nørgaard <tnn@tradeshift.com>
Closes #7676
2018-07-06 16:15:19 -07:00
Paul Dagnelie 3f4e6d6f36 Cause autogen.sh to fail if autoreconf fails
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7683
2018-07-06 09:27:37 -07:00
Serapheim Dimitropoulos 4d044c4c1d OpenZFS 9238 - ZFS Spacemap Encoding V2
Motivation
==========

The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
    This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
    device removal, zpool checkpoint) we've started imposing limits in the
    vdevs that can be used with them based on the maximum addressable offset
    (currently 64PB for a top-level vdev).

New encoding
============

The layout can be found at space_map.h and it remains backwards compatible with
the old one. The introduced two-word entry format, besides extending the limits
imposed by the single-entry layout, also includes a vdev field and some extra
padding after its prefix.

The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps (expected to be used in the log spacemap project).

One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and we wanted to fit the vdev field
in the space map. In addition that gives us some extra bits in dva_t.

Other references:
=================

The new encoding is also discussed towards the end of the Log Space Map
presentation from 2017's OpenZFS summit.
Link: https://www.youtube.com/watch?v=jj2IxRkl5bQ

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90a56e6d
OpenZFS-issue: https://www.illumos.org/issues/9238
Closes #7665
2018-07-05 12:02:34 -07:00
Ahmed Gahnem 4e82b4be78 OpenZFS 9184 - Add ZFS performance test for fixed blocksize random read/write IO
This change introduces a new performance test which does random reads
and writes, but instead of using `bssplit` to determine the block size,
it uses a fixed blocksize. Additionally, some new IO sizes are added to
other tests and timestamp data is recorded with the performance data.

Authored by: Ahmed Gahnem <ahmedg@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com>
Ported-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com>
Requires-builders: perf

OpenZFS-issue: https://www.illumos.org/issues/9184
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/659
External-issue: DLPX-46724
Closes #7660
2018-07-02 13:46:06 -07:00
Tom Caputi 370bbf66ae Fix coverity defects: CID 176037
CID 176037: Uninitialized scalar variable

This patch fixes an uninitialized variable defect caught by
coverity and introduced in 69830602

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7667
2018-07-02 13:37:48 -07:00
Brian Behlendorf e03a41a604 ZTS: Improve enospc tests
The enospc_002_pos test case would frequently fail due a command
succeeding when it was expected to fail due to lack of space.
In order to make this far less likely, files are created across
multiple transaction groups in order to consume as many unused
blocks as possible.

The dependency that the tests run on a partitioned block device
has been removed.  It's simpler to use sparse files.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7663
2018-06-29 09:40:32 -07:00
Tom Caputi da2feb42fb Fix 'zfs recv' of non large_dnode send streams
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.

Tested-by:  DHE <git@dehacked.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7617 
Closes #7662
2018-06-28 14:55:11 -07:00
Chunwei Chen edf60b8645 Enforce PROP_ONETIME on zpool properties
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #7661
2018-06-28 14:49:17 -07:00
Tom Caputi 69830602de Raw receive fix and encrypted objset security fix
This patch fixes two problems with the encryption code. First, the
current code does not correctly prohibit the DMU from updating
dn_maxblkid during object truncation within a raw receive. This
usually only causes issues when the truncating DRR_FREE record is
aggregated with DRR_FREE records later in the receive, so it is
relatively hard to hit.

Second, this patch fixes a security issue where reading blocks
within an encrypted object did not guarantee that the dnode block
itself had ever been verified against its MAC. Usually the
verification happened anyway when the bonus buffer was read, but
some use cases (notably zvols) might never perform the check.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7632
2018-06-28 09:20:34 -07:00
Tim Chase 3be1eb29da Fix formatting in zpool-features(5)
The formatting of the features beginning with large_blocks was broken
when the zpool_checkpoint feature was added.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7658
2018-06-27 09:34:25 -07:00
Eitan Adler fb8a10d5be OpenZFS 9521 - Add checkpoint field
Add checkpoint field in the default list of the zpool-list man page

Authored by: Eitan Adler <lists@eitanadler.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: kpande <github@tripleback.net>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9521
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c5a860f7b
Closes #7658
2018-06-27 09:33:37 -07:00
Low-power aebfb84851 Fix missing option '-e' in zpool online usage
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: WHR <msl0000023508@gmail.com>
Closes #7655
2018-06-26 10:17:55 -07:00
Serapheim Dimitropoulos d2734cce68 OpenZFS 9166 - zfs storage pool checkpoint
Details about the motivation of this feature and its usage can
be found in this blogpost:

    https://sdimitro.github.io/post/zpool-checkpoint/

A lightning talk of this feature can be found here:
https://www.youtube.com/watch?v=fPQA8K40jAM

Implementation details can be found in big block comment of
spa_checkpoint.c

Side-changes that are relevant to this commit but not explained
elsewhere:

* renames members of "struct metaslab trees to be shorter without
  losing meaning

* space_map_{alloc,truncate}() accept a block size as a
  parameter. The reason is that in the current state all space
  maps that we allocate through the DMU use a global tunable
  (space_map_blksz) which defauls to 4KB. This is ok for metaslab
  space maps in terms of bandwirdth since they are scattered all
  over the disk. But for other space maps this default is probably
  not what we want. Examples are device removal's vdev_obsolete_sm
  or vdev_chedkpoint_sm from this review. Both of these have a
  1:1 relationship with each vdev and could benefit from a bigger
  block size.

Porting notes:

* The part of dsl_scan_sync() which handles async destroys has
  been moved into the new dsl_process_async_destroys() function.

* Remove "VERIFY(!(flags & FWRITE))" in "kernel.c" so zhack can write
  to block device backed pools.

* ZTS:
  * Fix get_txg() in zpool_sync_001_pos due to "checkpoint_txg".

  * Don't use large dd block sizes on /dev/urandom under Linux in
    checkpoint_capacity.

  * Adopt Delphix-OS's setting of 4 (spa_asize_inflation =
    SPA_DVAS_PER_BP + 1) for the checkpoint_capacity test to speed
    its attempts to fill the pool

  * Create the base and nested pools with sync=disabled to speed up
    the "setup" phase.

  * Clear labels in test pool between checkpoint tests to avoid
    duplicate pool issues.

  * The import_rewind_device_replaced test has been marked as "known
    to fail" for the reasons listed in its DISCLAIMER.

  * New module parameters:

      zfs_spa_discard_memory_limit,
      zfs_remove_max_bytes_pause (not documented - debugging only)
      vdev_max_ms_count (formerly metaslabs_per_vdev)
      vdev_min_ms_count

Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9166
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7159fdb8
Closes #7570
2018-06-26 10:07:42 -07:00
Tim Chase 88eaf610d9 Use "eval" in history_002_pos for log_must
Otherwise the output is consumed by the output redirection.

Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7570
2018-06-26 09:58:05 -07:00
ajs124 a8577bdb32 Fix duplicate "fB" typo
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: ajs124 <git@ajs124.de>
Closes #7649
2018-06-25 09:50:01 -07:00
Serapheim Dimitropoulos 7637ef8d23 OpenZFS 9591 - ms_shift can be incorrectly changed
ms_shift can be incorrectly changed changed in MOS config for
indirect vdevs that have been historically expanded

According to spa_config_update() we expect new vdevs to have
vdev_ms_array equal to 0 and then we go ahead and set their metaslab
size. The problem is that indirect vdevs also have vdev_ms_array == 0
because their metaslabs are destroyed once their removal is done.

As a result, if a vdev was expanded and then removed may have its
ms_shift changed if another vdev was added after its removal.
Fortunately this behavior does not cause any type of crash or bad
behavior in the kernel but it can confuse zdb and anyone doing any kind
of analysis of the history of the pools.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-commit: https://github.com/openzfs/openzfs/pull/651
OpenZFS-issue: https://illumos.org/issues/9591a
External-issue: DLPX-58879
Closes #7644
2018-06-21 09:35:26 -07:00
Matthew Ahrens af43029484 Remove suffix from zio taskq names
For zio taskq's which have multiple instances (e.g. z_rd_int_0,
z_rd_int_1, etc), each one has a unique name (the _0, _1, _2 suffix).
This makes performance analysis more difficult, because by default,
`perf` includes the thread name (which is the same as the taskq name) in
the stack trace.  This means that we get 8 different stacks, all of
which are doing the same thing, but are executed from different taskq's.

We should remove the suffix of the taskq name, so that all the
read-interrupt threads are named z_rd_int.

Note that we already support multiple taskq's with the same name.  This
happens when there are multiple pools.  In this case the taskq has a
different tq_instance, which shows up in /proc/spl/taskq-all.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7646
2018-06-20 14:07:50 -07:00
Brian Behlendorf e4a3297a04 ZTS: Adopt OpenZFS test analysis script
Adopt and extend the OpenZFS ZTS results analysis script for use
with ZFS on Linux.  This allows for automatic analysis of tests
which may be skipped for a variety or reasons or which are not
entirely reliable.

In addition to the list of 'known' failures, which have been updated
for ZFS on Linux, there in a new 'maybe' section.  This mapping
include tests which might be correctly skipped depending on the
test environment.  This may be because of a missing dependency or
lack of required kernel support.  This list also includes tests
which normally pass but might on occasion fail for a harmless
reason.

The script was also extended include a reason for why a given test
might be skipped or may fail.  The reason will be included after
the test in the "results other than PASS that are expected" section.
For failures it is preferable to set the reason to the GitHub issue
number and for skipped tests several generic reasons are available.
You may also specify a custom reason if needed.

All tests were added back in to the linux.run file even if they are
expected to failed.  There is value in running tests which may not
pass, the expected results for these tests has been encoded in
the new analysis script.

All tests which were disabled because they ran more slowly on a
32-bit system have been re-enabled.  Developers working on 32-bit
systems should assess what it reasonable for their environment.

The unnecessary dependency on physical block devices was removed for
the checksum, grow_pool, and grow_replicas test groups so they are
no longer skipped.  Updated the filetest_001_pos test case to run
properly now that it is enabled and moved the grow tests in to a
single directory.

Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7638
2018-06-20 14:03:13 -07:00
Brian Behlendorf 1c38ac61e1 Linux 4.14 compat: blk_queue_stackable()
The blk_queue_stackable() function was replaced in the 4.14 kernel
by queue_is_rq_based(), commit torvalds/linux@5fdee212.  This change
resulted in the default elevator being used which can negatively
impact performance.

Rather than adding additional compatibility code to detect the
new interface unconditionally attempt to set the elevator.  Since
we expect this to fail for block devices without an elevator the
error message has been moved in to zfs_dbgmsg().

Finally, it was observed that the elevator_change() was removed
from the 4.12 kernel, commit torvalds/linux@c033269.  Update the
comment to clearly specify which are expected to export the
elevator_change() symbol.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7645
2018-06-19 21:52:45 -07:00
Brian Behlendorf 6413c95fbd Linux 4.18 compat: inode timespec -> timespec64
Commit torvalds/linux@95582b0 changes the inode i_atime, i_mtime,
and i_ctime members form timespec's to timespec64's to make them
2038 safe.  As part of this change the current_time() function was
also updated to return the timespec64 type.

Resolve this issue by introducing a new inode_timespec_t type which
is defined to match the timespec type used by the inode.  It should
be used when working with inode timestamps to ensure matching types.

The timestruc_t type under Illumos was used in a similar fashion but
was specified to always be a timespec_t.  Rather than incorrectly
define this type all timespec_t types have been replaced by the new
inode_timespec_t type.

Finally, the kernel and user space 'sys/time.h' headers were aligned
with each other.  They define as appropriate for the context several
constants as macros and include static inline implementation of
gethrestime(), gethrestime_sec(), and gethrtime().

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7643
2018-06-19 21:51:18 -07:00
kpande aeb39df726 Fix typo in comment, handeling->handling
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: Kash Pande <kash@tripleback.net>
Closes #7641
2018-06-18 21:51:06 -07:00
Tom Caputi cd32e5db8b Add ASSERT to debug encryption key mapping issues
This patch simply adds an ASSERT that confirms that the last
decrypting reference on a dataset waits until the dataset is
no longer dirty. This should help to debug issues where the
ZIO layer cannot find encryption keys after a dataset has been
disowned.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7637
2018-06-18 14:10:54 -07:00
Richard Yao 517d247192 copy-builtin: SPL must be in Kbuild first
The recent SPL merge caused a regression in kernels with ZFS integrated
into the sources where our modules would be initialized in alphabetical
order, despite icp requiring the spl module be loaded first. This caused
kernels with ZFS builtin to fail to boot.

We resolve this by adding a special case for the spl that lists it
first. It is somewhat ugly, but it works.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #7595 
Closes #7606
2018-06-15 15:16:29 -07:00
John Gallagher 917f475fba Add tunables for channel programs
This patch adds tunables for modifying the maximum memory limit and
maximum instruction limit that can be specified when running a channel
program.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1085
Closes #7618
2018-06-15 15:10:42 -07:00
Brian Behlendorf 7b98f0d91f Linux compat 4.18: check_disk_size_change()
Added support for the bops->check_events() interface which was
added in the 2.6.38 kernel to replace bops->media_changed().
Fully implementing this functionality allows the volume resize
code to rely on revalidate_disk(), which is the preferred
mechanism, and removes the need to use check_disk_size_change().

In order for bops->check_events() to lookup the zvol_state_t
stored in the disk->private_data the zvol_state_lock needs to
be held.  Since the check events interface may poll the mutex
has been converted to a rwlock for better concurrently.  The
rwlock need only be taken as a writer in the zvol_free() path
when disk->private_data is set to NULL.

The configure checks for the block_device_operations structure
were consolidated in a single kernel-block-device-operations.m4
file.

The ZFS_AC_KERNEL_BDEV_BLOCK_DEVICE_OPERATIONS configure checks
and assoicated dead code was removed.  This interface was added
to the 2.6.28 kernel which predates the oldest supported 2.6.32
kernel and will therefore always be available.

Updated maximum Linux version in META file.  The 4.17 kernel
was released on 2018-06-03 and ZoL is compatible with the
finalized kernel.

Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com>
Reviewed-by: Sara Hartse <sara.hartse@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7611
2018-06-15 15:05:21 -07:00
Allan Jude 29445fe3a0 Reserve DMU_BACKUP_FEATURE for ZSTD
Reserve bit 25 for the ZSTD compression feature from FreeBSD.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Allan Jude <allanjude@freebsd.org>
Closes #7626
2018-06-14 09:47:26 -07:00
Brian Behlendorf 7e0594a3da Remove libefi __linux__ wrappers
The ZoL version of libefi has been modified for Linux in several
places outside the existing __linux__ wrappers.  Remove them to
make the code easier to read and so as not to mislead anyone that
these are the sole modifications for Linux.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7625
2018-06-14 09:43:32 -07:00
Brian Behlendorf c91cf36fc2 Fix ztest_vdev_add_remove() test case
Commit 2ffd89fc allowed two new errors to be reported by zil_reset()
in order to provide a descriptive error message regarding why a log
device could not be removed.  However, the new return values were
not handled in the ztest_vdev_add_remove() test case resulting in
ztest failures during automated testing.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7630
2018-06-14 09:41:27 -07:00
Matthew Ahrens 1fac63e56f OpenZFS 9577 - remove zfs_dbuf_evict_key tsd
The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary -
we can instead pass a flag down in a few places to prevent recursive
dbuf eviction. Making this change has 3 benefits:

1. The code semantics are easier to understand.
2. On Linux, performance is improved, because creating/removing
   TSD values (by setting to NULL vs non-NULL) is expensive, and
   we do it very often.
3. According to Nexenta, the current semantics can cause a
   deadlock when concurrently calling dmu_objset_evict_dbufs()
   (which is rare today, but they are working on a "parallel
   unmount" change that triggers this more easily):

Porting Notes:
* Minor conflict with OpenZFS 9337 which has not yet been ported.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9577
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/645
External-issue: DLPX-58547
Closes #7602
2018-06-13 11:05:06 -07:00
Brian Behlendorf 232dd8b956 Fix efi_get_info() zvol detection
Partition detection for zvol devices was not working correctly
resulting inconsistent partitioning behavior when layering pools
on top of zvols.  This isn't a supported configuration but we'd
still like it to work properly.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7624
2018-06-13 10:20:58 -07:00
John Gallagher ab24877bd3 ZTS: deletes home directories in /export/home
In the cleanup for the privilege tests, an empty variable, empty because
the corresponding setup is skipped on Linux, results in /export/home
being deleted. This patch adds an assertion that the variable is not
empty, and causes the cleanup to be skipped on Linux as well.

Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
External-issue: LX-1099
Closes #7615
2018-06-12 10:42:26 -07:00
bunder2015 5277571260 ZTS: cleanup user_run
user_run leaves two files in /tmp, moving them to $TEST_BASE_DIR and
adding them to the default cleanup routine.

Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7614
2018-06-12 10:37:12 -07:00
Tom Caputi c634808ebb Add pyzfs build directories to gitignore
The recent addition of pyzfs does not include the generated 'build'
and 'pyzfs.egg-info' directories in the pyzfs .gitignore or the
'make clean' target. This patch simply corrects this problem.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7612
2018-06-11 18:42:12 -07:00
Paul Zuchowski 2ffd89fcb9 Wrong error message when removing log device
In the case where the pool is loaded without the crypto
keys necessary to playback the intent log, and log device
removal is attempted, a generic busy message is received.
Change the message to inform the user that the datasets
must be mounted.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7518
2018-06-07 18:07:29 -07:00
Brian Behlendorf 174bcd581d Fix preemptible warning in aggsum_add()
In the new aggsum counters the CPU_SEQID macro should be surrounded by
kpreempt_disable)() and kpreempt_enable() calls to prevent a Linux
kernel BUG warning.  The addsum_add() function use the cpuid to
minimize lock contention when selecting a bucket, after selection
the bucket is protected by a mutex and it is safe to reschedule the
process to a different processor at any time.

Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7609
Closes #7610
2018-06-07 15:55:11 -07:00
Antonio Russo 39042f9736 Tunable directory for zfs runtime scripts
zpool and zed place scripts in subdirectories of libexecdir. Some
distributions locate architecture independent scripts in other locations
(e.g. Debian). To avoid these paths getting out of sync, centralize the
definitions.

Build zfs-test's default.cfg by Makefile.  Use the new directory
logic building tests/zfs-tests/include/default.cfg.in.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #7597
2018-06-07 09:59:59 -07:00
Nathaniel Clark fba33c3819 Don't panic on bad SA_MAGIC in sa_build_index
If sa_build_index() encounters a corrupt buffer, don't panic.
Add info to zfs ring buffer and return EIO.  This allows for a cleaner
error recovery path.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Issue #6500 
Closes #7487
2018-06-07 09:51:56 -07:00
Antonio Russo 7106b23640 Minor documentation, logging, and testing typos
This patch collects some minor inconsistencies and typos in the
documentation, logging and testing infrastructure.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #7608
2018-06-07 09:38:39 -07:00
Tom Caputi b405837a6c Update the correct abd in l2arc_read_done()
This patch fixes an issue where l2arc_read_done() would always
write data to b_pabd, even if raw encrypted data was requested.
This only occured in cases where the L2ARC device had a different
ashift than the main pool.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7586 
Closes #7593
2018-06-06 10:17:50 -07:00
Tom Caputi e7504d7a18 Raw receive functions must not decrypt data
This patch fixes a small bug found where receive_spill() sometimes
attempted to decrypt spill blocks when doing a raw receive. In
addition, this patch fixes another small issue in arc_buf_fill()'s
error handling where a decryption failure (which could be caused by
the first bug) would attempt to set the arc header's IO_ERROR flag
without holding the header's lock.

Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7564 
Closes #7584 
Closes #7592
2018-06-06 10:16:41 -07:00
Alek P 6969afcefd Always continue recursive destroy after error
Currently, during a recursive zfs destroy the first error that is
encountered will stop the destruction of the datasets. Errors may
happen for a variety of reasons including competing deletions
and busy datasets.
This patch switches recursive destroy to always do a best-effort
recursive dataset destroy.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #7574
2018-06-06 10:14:52 -07:00
bunder2015 62841115bc ZTS: history path cleanup
History tests were hard coded to use /tmp and didn't clean up
properly after testing.

Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Issue #7507 
Closes #7600
2018-06-06 10:00:22 -07:00
Paul Dagnelie 37fb3e4318 OpenZFS 8484 - Implement aggregate sum and use for arc counters
In pursuit of improving performance on multi-core systems, we should
implements fanned out counters and use them to improve the performance of
some of the arc statistics. These stats are updated extremely frequently,
and can consume a significant amount of CPU time.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Paul Dagnelie <pcd@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/8484
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7028a8b92b7
Issue #3752
Closes #7462
2018-06-06 09:35:59 -07:00
Tony Hutter f0ed6c7448 Add pool state /proc entry, "SUSPENDED" pools
1. Add a proc entry to display the pool's state:

$ cat /proc/spl/kstat/zfs/tank/state
ONLINE

This is done without using the spa config locks, so it will
never hang.

2. Fix 'zpool status' and 'zpool list -o health' output to print
"SUSPENDED" instead of "ONLINE" for suspended pools.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7331 
Closes #7563
2018-06-06 09:33:54 -07:00
Brian Behlendorf 2d9142c9d4 Remove rwlock wrappers
The only remaining consumer of the rwlock compatibility wrappers
is ztest.  Remove the wrappers and convert the few remaining
calls to the underlying pthread functions.

    rwlock_init()    -> pthread_rwlock_init()
    rwlock_destroy() -> pthread_rwlock_destroy()
    rw_rdlock()      -> pthread_rwlock_rdlock()
    rw_wrlock()      -> pthread_rwlock_wrlock()
    rw_unlock()      -> pthread_rwlock_unlock()

Note pthread_rwlock_init() defaults to PTHREAD_PROCESS_PRIVATE
which is equivilant to the USYNC_THREAD behavior.  There is no
functional change.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7591
2018-06-04 16:52:10 -07:00
Serapheim Dimitropoulos e48afbc4eb OpenZFS 9464 - txg_kick() fails to see that we are quiescing
txg_kick() fails to see that we are quiescing, forcing transactions to
their next stages without leaving them accumulate changes

Creating a fragmented pool in a DCenter VM and continuously writing to it with
multiple instances of randwritecomp, we get the following output from txg.d:

    0ms   311MB in  4114ms (95% p1)  75MB/s  544MB (76%)  336us   153ms     0ms
    0ms     8MB in    51ms ( 0% p1) 163MB/s  474MB (66%)  129us    34ms     0ms
    0ms   366MB in  4454ms (93% p1)  82MB/s  572MB (79%)  498us    20ms     0ms
    0ms   406MB in  5212ms (95% p1)  77MB/s  591MB (82%)  661us    37ms     0ms
    0ms   340MB in  5110ms (94% p1)  66MB/s  622MB (86%) 1048us    41ms     1ms
    0ms     3MB in    61ms ( 0% p1)  51MB/s  419MB (58%)   33us     0ms     0ms
    0ms   361MB in  3555ms (88% p1) 101MB/s  542MB (75%)  335us    40ms     0ms
    0ms   356MB in  4592ms (92% p1)  77MB/s  561MB (78%)  430us    89ms     1ms
    0ms    11MB in   129ms (13% p1)  90MB/s  507MB (70%)  222us    15ms     0ms
    0ms   281MB in  2520ms (89% p1) 111MB/s  542MB (75%)  334us    42ms     0ms
    0ms   383MB in  3666ms (91% p1) 104MB/s  557MB (77%)  411us   133ms     0ms
    0ms   404MB in  5757ms (94% p1)  70MB/s  635MB (88%) 1274us   123ms     2ms
    4ms   367MB in  4172ms (89% p1)  88MB/s  556MB (77%)  401us    51ms     0ms
    0ms    42MB in   470ms (44% p1)  90MB/s  557MB (77%)  412us    43ms     0ms
    0ms   261MB in  2273ms (88% p1) 114MB/s  556MB (77%)  407us    27ms     0ms
    0ms   394MB in  3646ms (85% p1) 108MB/s  552MB (77%)  393us   304ms     0ms
    0ms   275MB in  2416ms (89% p1) 113MB/s  510MB (71%)  200us    53ms     0ms
    0ms     9MB in    53ms ( 0% p1) 169MB/s  483MB (67%)  140us   100ms     1ms

The TXGs that are getting synced and don't have lots of changes are pushed by
txg_kick() which basically forces the current open txg to get to the quiesced
state:

        if (tx->tx_syncing_txg == 0 &&
        tx->tx_quiesce_txg_waiting <= tx->tx_open_txg &&
        tx->tx_sync_txg_waiting <= tx->tx_synced_txg &&
        tx->tx_quiesced_txg <= tx->tx_synced_txg) {
        tx->tx_quiesce_txg_waiting = tx->tx_open_txg + 1;
        cv_broadcast(&tx->tx_quiesce_more_cv);
    }

The problem is that the above code doesn't check if we are currently quiescing
anything (only if a quiesce or a sync has been requested, ..etc) so the
following scenario can happen:

1] We have an open txg A that had enough dirty data (more than
   zfs_dirty_data_sync) and it was pushed to the quiesced state, and opened
   a new txg B. No txg is currently being synced.
2] Immediately after the opening of B, txg_kick() was run by some other write
   (and because of A's dirty data) and saw that we are not currently syncing
   any txg and no one has requested quiescing so it requests one by bumping
   tx_quiesce_txg_waiting and broadcasts the quiesce thread.
3] The quiesce thread just passed txg A to be synced and sees that a quiescing
   request has been sent to it so it immediately grabs B without letting it
   gather enough data, putting it in a quiesced state and opening a new txg C.

In this scenario txg B, is an example of how the entries of interest show up in
the txg.d output.

Ideally we would like txg_kick() to get triggered only when we are sure that
we are not syncing AND not quiescing any txg. This way we can kick an open TXG
to the quiescing state when we are sure that there is nothing going on and we
would benefit from the different states running concurrently.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9464
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1cd7635b
Closes #7587
2018-06-04 14:56:06 -07:00
John Wren Kennedy ab44e511e2 OpenZFS 9245 - zfs-test failures: slog_013_pos and slog_014_pos
Test 13 would fail because of attempts to zpool destroy -f a pool that
was still busy. Changed those calls to destroy_pool which does a retry
loop, and the problem is no longer reproducible. Also removed some non
functional code in the test which is why it was originally commented out
by placing it after the call to log_pass.

Test 14 would fail because sometimes the check for a degraded pool would
complete before the pool had changed state. Changed the logic to check
in a loop with a timeout and the problem is no longer reproducible.

Authored by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
* Re-enabled slog_013_pos.ksh

OpenZFS-issue: https://illumos.org/issues/9245
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8f323b5
Closes #7585
2018-06-04 14:55:02 -07:00
Pavel Zakharov 8a393be353 OpenZFS 9235 - rename zpool_rewind_policy_t to zpool_load_policy_t
We want to be able to pass various settings during import/open of a
pool, which are not only related to rewind. Instead of adding a new
policy and duplicate a bunch of code, we should just rename
rewind_policy to a more generic term like load_policy.

For instance, we'd like to set spa->spa_import_flags from the nvlist,
rather from a flags parameter passed to spa_import as in some cases we
want those flags not only for the import case, but also for the open
case. One such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb)
which would allow zfs to open a pool when logs are missing.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9235
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d2b1e44
Closes #7532
2018-06-04 14:54:20 -07:00
bunder2015 85912983a4 Fix typoes in zpool man page
Fixed some highlighting in the zpool man page

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7596
2018-06-04 09:06:16 -07:00
Matthew Ahrens 1a5b96b8ee OpenZFS 9329 - panic in zap_leaf_lookup() due to concurrent zapification
For the null pointer issue shown below, the solution is to initialize the
contents of the object before changing its type, so that concurrent accessors
will see it as non-zapified until it is ready for access via the ZAP.

    BAD TRAP: type=e (#pf Page fault) rp=ffffff00ff520440 addr=20 occurred
    in module "zfs" due to a NULL pointer dereference

    ffffff00ff520320 unix:die+df ()
    ffffff00ff520430 unix:trap+dc0 ()
    ffffff00ff520440 unix:cmntrap+e6 ()
    ffffff00ff520590 zfs:zap_leaf_lookup+46 ()
    ffffff00ff520640 zfs:fzap_lookup+a9 ()
    ffffff00ff5206e0 zfs:zap_lookup_norm+111 ()
    ffffff00ff520730 zfs:zap_contains+42 ()
    ffffff00ff520760 zfs:dsl_dataset_has_resume_receive_state+47 ()
    ffffff00ff520900 zfs:get_receive_resume_stats+3e ()
    ffffff00ff520a90 zfs:dsl_dataset_stats+262 ()
    ffffff00ff520ac0 zfs:dmu_objset_stats+2b ()
    ffffff00ff520b10 zfs:zfs_ioc_objset_stats_impl+64 ()
    ffffff00ff520b60 zfs:zfs_ioc_objset_stats+33 ()
    ffffff00ff520bd0 zfs:zfs_ioc_dataset_list_next+140 ()
    ffffff00ff520c80 zfs:zfsdev_ioctl+4d7 ()
    ffffff00ff520cc0 genunix:cdev_ioctl+39 ()
    ffffff00ff520d10 specfs:spec_ioctl+60 ()
    ffffff00ff520da0 genunix:fop_ioctl+55 ()
    ffffff00ff520ec0 genunix:ioctl+9b ()
    ffffff00ff520f10 unix:brand_sys_sysenter+1c9 ()

Porting Notes:
* DMU_OT_BYTESWAP conditional in zap_lockdir_impl() kept.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9329
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/e8e0f97
Closes #7578
2018-05-31 10:53:49 -07:00
Matthew Ahrens d2a12f9e2a OpenZFS 9328 - zap code can take advantage of c99
The ZAP code was written before we allowed c99 in the Solaris kernel. We
should change it to take advantage of being able to declare variables where
they are first used. This reduces variable scope and means less scrolling
to find the type of variables.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9328
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/76ead05
Closes #7578
2018-05-31 10:53:11 -07:00
Sara Hartse 74d42600d8 zpool reopen should detect expanded devices
Update bdev_capacity to have wholedisk vdevs query the
size of the underlying block device (correcting for the size
of the efi parition and partition alignment) and therefore detect
expanded space.

Correct vdev_get_stats_ex so that the expandsize is aligned
to metaslab size and new space is only reported if it is large
enough for a new metaslab.

Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Signed-off-by: sara hartse <sara.hartse@delphix.com>
External-issue: LX-165
Closes #7546 
Issue #7582
2018-05-31 10:36:37 -07:00
Matthew Ahrens d1f06ec5bc make install only works once
`make install` shouldn't fail if a directory it created still exists.
In this case we can blow away the spl src directory before recreating
it.  This also gracefully handles the migration from pre-spl-merge to
post-spl-merge.

Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7580
2018-05-31 09:19:59 -07:00
Antonio Russo 928046b744 Explicitly state supported Linux versions
Add META tags Linux-Maximum and Linux-Minimum.

One pain point for package maintainers is ensuring the compatibility of
the packaged version of ZFS with the Linux kernel. By providing an
authoritative compatibility guide in the source tree, maintainers can
automate compatibility checks.

Additionally, increase META string extraction specificity.
configure.ac finds Name and Version by a very simple `grep`, which might
conceivably find other fields. Require the string be at the beginning of
a line, and be followed by a colon to avoid such confusions.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #7571
2018-05-30 20:11:19 -07:00
John Wren Kennedy 93491c4bb9 OpenZFS 9082 - Add ZFS performance test targeting ZIL latency
This adds a new test to measure ZIL performance.

- Adds the ability to induce IO delays with zinject
- Adds a new variable (PERF_NTHREADS_PER_FS) to allow fio threads to
  be distributed to individual file systems as opposed to all IO going
  to one, as happens elsewhere.
- Refactoring of do_fio_run

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: John Wren Kennedy <jwk404@gmail.com>

OpenZFS-issue: https://www.illumos.org/issues/9082
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/634
External-issue: DLPX-48625
Closes #7491
2018-05-30 11:59:04 -07:00
Tony Hutter c26cf0966d Fix zio->io_priority failed (7 < 6) assert
This fixes an assert in vdev_queue_change_io_priority():

  VERIFY3(zio->io_priority < ZIO_PRIORITY_NUM_QUEUEABLE) failed (7 < 6)
  PANIC at vdev_queue.c:832:vdev_queue_change_io_priority()

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7566 
Closes #7542
2018-05-29 18:13:48 -07:00
Steffen Müthing 3c28c63642 Install basename utility into dracut initramfs
vdev_id requires the program `basename` when handling short aliases
defined in `vdev_id.conf` (those defined without a leading path), but
`basename` is not always available in the dracut environment. This
causes the pool device names to change when using `by-vdev/` devices
or (in extreme cases) can make the pool import fail in dracut.

This commit fixes the problem by explicitly installing `basename`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Steffen Müthing <steffen.muething@iwr.uni-heidelberg.de>
Closes #7562
2018-05-29 17:32:05 -07:00
Brian Behlendorf 93ce2b4ca5 Update build system and packaging
Minimal changes required to integrate the SPL sources in to the
ZFS repository build infrastructure and packaging.

Build system and packaging:
  * Renamed SPL_* autoconf m4 macros to ZFS_*.
  * Removed redundant SPL_* autoconf m4 macros.
  * Updated the RPM spec files to remove SPL package dependency.
  * The zfs package obsoletes the spl package, and the zfs-kmod
    package obsoletes the spl-kmod package.
  * The zfs-kmod-devel* packages were updated to add compatibility
    symlinks under /usr/src/spl-x.y.z until all dependent packages
    can be updated.  They will be removed in a future release.
  * Updated copy-builtin script for in-kernel builds.
  * Updated DKMS package to include the spl.ko.
  * Updated stale AUTHORS file to include all contributors.
  * Updated stale COPYRIGHT and included the SPL as an exception.
  * Renamed README.markdown to README.md
  * Renamed OPENSOLARIS.LICENSE to LICENSE.
  * Renamed DISCLAIMER to NOTICE.

Required code changes:
  * Removed redundant HAVE_SPL macro.
  * Removed _BOOT from nvpairs since it doesn't apply for Linux.
  * Initial header cleanup (removal of empty headers, refactoring).
  * Remove SPL repository clone/build from zimport.sh.
  * Use of DEFINE_RATELIMIT_STATE and DEFINE_SPINLOCK removed due
    to build issues when forcing C99 compilation.
  * Replaced legacy ACCESS_ONCE with READ_ONCE.
  * Include needed headers for `current` and `EXPORT_SYMBOL`.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
TEST_ZIMPORT_SKIP="yes"
Closes #7556
2018-05-29 16:00:33 -07:00
Brian Behlendorf 1272941f49 Merge branch 'zfsonlinux/merge-spl'
Merge a minimal version of the zfsonlinux/spl repository in to the
zfsonlinux/zfs repository.  Care was taken to prevent file conflicts
when merging and to preserve the spl repository history.  The spl
kernel module remains under the GPLv2 license as documented by the
additional THIRDPARTYLICENSE.gplv2 file.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-05-29 14:57:55 -07:00
Brian Behlendorf a91258913f Prepare SPL repo to merge with ZFS repo
This commit removes everything from the repository except the core
SPL implementation for Linux.  Those files which remain have been
moved to non-conflicting locations to facilitate the merge.
The README.md and associated files have been updated accordingly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-05-29 14:51:39 -07:00
Antonio Russo 3e5300e0ed Support Debian DKMS builds
scripts/dkms.mkconf calls configure with
`--with-linux=${kernel_source_dir}`, but Debian puts it kernel source at
`/lib/modules/<version>/source`. This patch adds the same logic to the
DKMS file produced by `scripts/dkms.mkconf` that Debian has shipped in
its official ZFS packaging: at DKMS build time, it checks if the system
is a Debian system, and adjusts the path accordingly.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #7358 
Closes #7540 
Closes #7554
2018-05-26 10:56:24 -07:00
Jorgen Lundman 561ba8d1b1 OpenZFS 9523 - Large alloc in zdb can cause trouble
16MB alloc in zdb_embedded_block() can cause cores in certain
situations (clang, gcc55).

Authored by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
* Replaces an equivalent fix previously made for Linux.

OpenZFS-issue: https://illumos.org/issues/9523
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2c1964a
Closes #7561
2018-05-25 17:24:57 -07:00
Matthew Ahrens 0dc2f70c5c OpenZFS 9486 - reduce memory used by device removal on fragmented pools
Device removal allocates a new location for each allocated segment on
the disk that's being removed.  Each allocation results in one entry in
the mapping table, which maps from old location + length to new
location.  When a fragmented disk is removed, this can result in a large
number of mapping entries, and thus a large amount of memory consumed by
the mapping table.  In the worst real-world cases, we've seen around 1GB
of RAM per 1TB of storage removed.

We can improve on this situation by allocating larger segments, which
span across both allocated and free regions of the device being removed.
By including free regions in the allocation (and thus mapping), we
reduce the number of mapping entries.  For example, if we have a 4K
allocation followed by 1K free and then 4K allocated, we would allocate
4+1+4 = 9KB, and then move the entire region (including allocated and
free parts).  In this case we used one mapping where previously we would
have used two, but often the ratio is much higher (up to 20:1 in
real-world use).  We then need to mark the regions that were free on the
removing device as free in the new locations, and also obsolete in the
mapping entry.

This method preserves the fragmentation of the removing device, rather
than consolidating its allocated space into a small number of chunks
where possible.  But it results in drastic reduction of memory used by
the mapping table - around 20x in the most-fragmented cases.

In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed.  Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

Porting notes:

* Add the following as module parameters:
    * zfs_condense_indirect_vdevs_enable
    * zfs_condense_max_obsolete_bytes

* Document the following module parameters:
   * zfs_condense_indirect_vdevs_enable
   * zfs_condense_max_obsolete_bytes
   * zfs_condense_min_mapping_bytes

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9486
OpenZFS-commit: https://github.com/ahrens/illumos/commit/07152e142e44c
External-issue: DLPX-57962
Closes #7536
2018-05-24 10:18:07 -07:00
Tony Nguyen ba863d0be4 Profiling for perf tests
Stack profiling is quite useful and Linux ZFS test suite does not
current collect that data.

Linux perf is a common tool for this purpose though the perf record
data file can be quite large. With this change, Linux ZFS perf tests
capture perf record data if perf is installed on the system and
PERF_DO_PROFILING environment variable is set.

Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tony Nguyen <tony.nguyen@delphix.com>
External-issue: LX-971
Closes #7549
2018-05-22 10:51:46 -07:00
Matthew Ahrens a430cef9cd Create "bin" directory so that zloop.sh works
Before running zloop.sh, we need to run `scripts/zfs-tests.sh -c` to
create and populate the `bin` directory with symlinks to our utilities.
Rather than making developers remember to do this, `make` should do it
for them.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7525 
Closes #7547
2018-05-21 10:36:59 -07:00
George Melikov 43eb39d6cc Small cleanup of PR and issue templates
- Add links to PULL_REQUEST_TEMPLATE.md
- Clean `System information` table

It's easier to find needes documentation about
PR process with links.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #7539
2018-05-15 09:02:57 -07:00
George Melikov f46209dd6b Update tests/README.md and fix markdown
- there are more options now
- command examples are more readable in code style

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #7538
2018-05-15 09:01:28 -07:00
Brian Behlendorf ab6a2b5cd7 ZTS: Improve zpool_scrub_004_pos reliability
It's possible for the `zpool attach` portion of this test case
to complete before the `zpool scrub` can be issued.  Update the
test case to force the resilvering phase to take longer.

Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5444 
Closes #7541
2018-05-15 08:58:46 -07:00
Brian Behlendorf 8c64fe0442 ZTS: Update O_TMPFILE support check
In CentOS 7.5 the kernel provided a compatibility wrapper to support
O_TMPFILE.  This results in the test setup script correctly detecting
kernel support.  But the ZFS module was built without O_TMPFILE
support due to the non-standard CentOS kernel interface.

Handle this case by updating the setup check to fail either when
the kernel or the ZFS module fail to provide support.  The reason
will be clearly logged in the test results.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7528
2018-05-14 20:36:30 -07:00
Pavel Zakharov 38a19edd34 OpenZFS 9189 - Add debug to vdev_label_read_config when txg check fails
These changes were added to help debug issue #9187.

Essentially, in the original bug, vdev_validate() seems to fails in
vdev_label_read_config() and prints "failed reading config". This could
happen because either:
1. The labels are actually corrupt and zio_wait() fails for all of them
2. The labels were discarded because they didn't pass the txg check.

Beyond 9187, having debug info when case 2 happens could be useful in
other scenarios, such as zpool import.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by:  Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9189
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f6af1b7
Closes #7533
2018-05-14 14:32:49 -04:00
Pavel Zakharov db7d07e14b OpenZFS 9191 - dump vdev tree to zfs_dbgmsg when spa load fails due to missing log devices
Add vdev_print_tree() in spa_check_for_missing_logs() when some log
devices are missing to ease debugging

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9191
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c5c02e5
Closes #7531
2018-05-14 14:30:52 -04:00
Pavel Zakharov 189bd0b670 OpenZFS 9190 - Fix cleanup routine in import_cachefile_device_replaced.ksh
Must clear slow-disk zinject injections in test cleanup routine.
Otherwise, when this test fails, it causes most subsequent tests
to fail.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9190
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/762c6b4
Closes #7530
2018-05-14 14:30:28 -04:00
Pavel Zakharov a11c7aaec9 OpenZFS 9187 - racing condition between vdev label and spa_last_synced_txg in vdev_validate
ztest failed with uncorrectable IO error despite having the fix for
7163.  Both sides of the mirror have CANT_OPEN_BAD_LABEL, which also
distinguishes it from that issue.

Definitely seems like a racing condition between the vdev_validate
and spa_sync:
1. Thread A (spa_sync): vdev label is updated to latest txg
2. Thread B (vdev_validate): vdev label's txg is compared to
   spa_last_synced_txg and is ahead.
3. Thread A (spa_sync): spa_last_synced_txg is updated to latest txg.

Solution: do not check txg in vdev_validate unless config lock is held.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9187
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/805fda72
Closes #7529
2018-05-14 14:28:09 -04:00
Brian Behlendorf b669ab83bb Ignore *.o.ur-safe build artifacts
Generated when building on Ubuntu 18.04.  Also ignore the new
dynamically generated zfs-mount-generator.8 man page, and the
module/.cache.mk file.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7534
2018-05-13 18:59:02 -07:00
Olaf Faaland bc5f51c5de module param callbacks check for initialized spa
Callbacks provided for module parameters are executed both
after the module is loaded, when a user alters it via sysfs, e.g
	echo bar > /sys/modules/zfs/parameters/foo

as well as when the module is loaded with an argument, e.g.
	modprobe zfs foo=bar

In the latter case, the init functions likely have not run yet,
including spa_init() which initializes the namespace lock so it is safe
to use.

Instead of immediately taking the namespace lock and attemping to
iterate over initialized spa structures, check whether spa_mode_global
is nonzero.  This is set by spa_init() after it has initialized the
namespace lock.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7496 
Closes #7521
2018-05-11 12:46:07 -07:00
Antonio Russo 68fded8146 Add canonical mount options zfs-mount-generator
lib/libzfs/libzfs_mount.c:zfs_add_options provides the canonical
mount options used by a `zfs mount` command. Because we cannot call
`zfs mount` directly from a systemd.mount unit, we mirror that logic
in zfs-mount-generator.

The zed script is updated to cache these properties as well.

Include a mini-tutorial in the manual page, properly substitute
configuration paths in zfs-mount-generator.8.in, and standardize the
Makefile.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #7453
2018-05-11 12:44:14 -07:00
bunder2015 29badadd4e Fix shebangs on import tests
Incorrect shebangs were used when porting.

Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7523  
Closes #7524
2018-05-11 12:37:44 -07:00
Tim Chase d1043e2f6d Unify behavior of deadman parameters
The zfs_deadman_failmode, zfs_deadman_ziotime_ms and
zfs_deadman_synctime_ms paramaters are stored per-pool.  However,
only the zfs_deadman_failmode updates the per-pool state when it's
change.  This patch gives adds the same behavior to the other two
for consistency.

Also, in all 3 three cases, only update the per-pool parameters
if spa_init() has actually been called in order to avoid panicking
when trying to take a lock on the spa_namespace_lock mutex.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7499
2018-05-08 21:45:47 -07:00
bunder2015 670d74b9ce ZTS: enospc_002 path cleanup
Removing hard-coded path used in enospc_002

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7515
2018-05-08 21:42:58 -07:00
Tim Chase f3d28f0a59 Streamline the zpool_import tests
Don't create an ext4 file system atop $DEV_DISKDIR/$DISK2.
There's likely to not be sufficient space for it to succeed.
Instead, simply create the vdev files in the directory where it
would have been mounted.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7459
2018-05-08 21:40:13 -07:00
Tim Chase a0ad7ca54e Clear vdev_faulted
Clear vdev_faulted if ZPOOL_CONFIG_AUX_STATE is not set to "external"

ZoL supports "zpool export -f" (force fault), which can be combined
with "-t" (temporary fault; don't persist across export/import) and
causes a MOS configuration to be set with ZPOOL_CONFIG_FAULTED=1
and without ZFS_CONFIG_AUX_STATE set at all.  In this case, the
previously-offlined vdev should be imported in an on-line state and.
Clearing the "vdev_faulted" flag causes the import to treat the
device as on-line.  Typically, resilver will catch it up based on
its DTL.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7459
2018-05-08 21:39:50 -07:00
Pavel Zakharov 6cb8e5306d OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery
Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:

	7638 Refactor spa_load_impl into several functions
	8961 SPA load/import should tell us why it failed
	7277 zdb should be able to print zfs_dbgmsg's

To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.

The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.

When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.

This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.

With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
of data, this feature is for now restricted to a read-only import.

Porting notes (ZTS):
* Fix 'make dist' target in zpool_import

* The maximum path length allowed by tar is 99 characters.  Several
  of the new test cases exceeded this limit resulting in them not
  being included in the tarball.  Shorten the names slightly.

* Set/get tunables using accessor functions.

* Get last synced txg via the "zfs_txg_history" mechanism.

* Clear zinject handlers in cleanup for import_cache_device_replaced
  and import_rewind_device_replaced in order that the zpool can be
  exported if there is an error.

* Increase FILESIZE to 8G in zfs-test.sh to allow for a larger
  ext4 file system to be created on ZFS_DISK2.  Also, there's
  no need to partition ZFS_DISK2 at all.  The partitioning had
  already been disabled for multipath devices.  Among other things,
  the partitioning steals some space from the ext4 file system,
  makes it difficult to accurately calculate the paramters to
  parted and can make some of the tests fail.

* Increase FS_SIZE and FILE_SIZE in the zpool_import test
  configuration now that FILESIZE is larger.

* Write more data in order that device evacuation take lonnger in
  a couple tests.

* Use mkdir -p to avoid errors when the directory already exists.

* Remove use of sudo in import_rewind_config_changed.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9075
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123
Closes #7459
2018-05-08 21:35:27 -07:00
Pavel Zakharov afd2f7b711 OpenZFS 8962 - zdb should work on non-idle pools
Currently `zdb` consistently fails to examine non-idle pools as it
fails during the `spa_load()` process. The main problem seems to be
that `spa_load_verify()` fails as can be seen below:

    $ sudo zdb -d -G dcenter
    zdb: can't open 'dcenter': I/O error

    ZFS_DBGMSG(zdb):
    spa_open_common: opening dcenter
    spa_load(dcenter): LOADING
    disk vdev '/dev/dsk/c4t11d0s0': best uberblock found for spa dcenter. txg 40824950
    spa_load(dcenter): using uberblock with txg=40824950
    spa_load(dcenter): UNLOADING
    spa_load(dcenter): RELOADING
    spa_load(dcenter): LOADING
    disk vdev '/dev/dsk/c3t10d0s0': best uberblock found for spa dcenter. txg 40824952
    spa_load(dcenter): using uberblock with txg=40824952
    spa_load(dcenter): FAILED: spa_load_verify failed [error=5]
    spa_load(dcenter): UNLOADING

This change makes `spa_load_verify()` a dryrun when ran from
`zdb`. This is done by creating a global flag in zfs and then setting
it in `zdb`.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/8962
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/180ad792
Closes #7459
2018-05-08 21:32:57 -07:00
Pavel Zakharov 4a0ee12af8 OpenZFS 8961 - SPA load/import should tell us why it failed
Problem
=======

When we fail to open or import a storage pool, we typically don't
get any additional diagnostic information, just "no pool found" or
"can not import".

While there may be no additional user-consumable information, we should
at least make this situation easier to debug/diagnose for developers
and support.  For example, we could start by using `zfs_dbgmsg()`
to log each thing that we try when importing, and which things
failed. E.g. "tried uberblock of txg X from label Y of device Z". Also,
we could log each of the stages that we go through in `spa_load_impl()`.

Solution
========

Following the cleanup to `spa_load_impl()`, debug messages have been
added to every point of failure in that function. Additionally,
debug messages have been added to strategic places, such as
`vdev_disk_open()`.

Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/8961
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/418079e0
Closes #7459
2018-05-08 21:30:10 -07:00
Paul Dagnelie ca0845d59e OpenZFS 9256 - zfs send space estimation off by > 10% on some datasets
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Porting Notes:
* Added tuning to man page.
* Test case changes dropped, default behavior unchanged.

OpenZFS-issue: https://www.illumos.org/issues/9256
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/32356b3c56
Closes #7470
2018-05-08 08:59:24 -07:00
LOLi 4ceb8dd6fd Fix 'zpool create -t <tempname>'
Creating a pool with a temporary name fails when we also specify custom
dataset properties: this is because we mistakenly call
zfs_set_prop_nvlist() on the "real" pool name which, as expected,
cannot be found because the SPA is present in the namespace with the
temporary name.

Fix this by specifying the correct pool name when setting the dataset
properties.

Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7502 
Closes #7509
2018-05-07 21:11:58 -07:00
Brian Behlendorf c02c1becce ZTS: Re-enable MMP tests
Commit 7fab6361 inadvertently disabled the MMP test cases by creating
and not removing an /etc/hostid file in the new zpool_split_props test
case.  When the file exists the ZTS skips the entire MMP test group
rather than modify what may be a system which is already configured.
Update the test case to remove the file.

Additionally, because the MMP tests were disabled a regression slipped
in as part of commit 9eb7b46ed0.  Fix it.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7514
2018-05-07 21:08:33 -07:00
Matthew Ahrens 1149b62d20 Update README: run autogen first
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #704
2018-05-07 10:12:25 -07:00
bunder2015 a82a4a15be ZTS: remove dead cleanup code from snapshot tests
Caught during path cleanups, the files referenced do not appear to be
created or used anywhere.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7508
2018-05-06 21:02:10 -07:00
Tomohiro Kusumi e1245d83e9 Prevent make distclean removing 0 sized file
__init__.py used by Python packages typically has nothing in it
including contrib/pyzfs/libzfs_core/test/__init__.py, however this
causes `make distclean` to delete the file.

This is the only file with size 0, and it seems reasonable to have
a comment to avoid being deleted, rather than trying to modify
distclean behavior.

 # find . -size 0
 ./contrib/pyzfs/libzfs_core/test/__init__.py
 # ./autogen.sh ; ./configure ; make -j8
 # make distclean
 # ls contrib/pyzfs/libzfs_core/test/__init__.py
 ls: cannot access 'contrib/pyzfs/libzfs_core/test/__init__.py':
     No such file or directory
 # git diff
 diff --git a/contrib/pyzfs/libzfs_core/test/__init__.py
     b/contrib/pyzfs/libzfs_core/test/__init__.py
 deleted file mode 100644

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7505
2018-05-06 20:46:13 -07:00
LOLi fb0da71fd9 ZTS: Fix zfs_diff_timestamp
When using mawk instead of gawk zfs_diff_timestamp fails consistently:
this is due to a subtle difference in how mawk handles substr().

From awk(1):
---
Finally, here is how mawk handles exceptional cases not discussed in
the AWK book or the Posix draft.  It is unsafe to assume consistency
across awks and safe to skip to the next section.
substr(s,  i,  n) returns the characters of s in the intersection of
the closed interval [1, length(s)] and the half-open interval [i, i+n).
When this intersection is empty, the empty string is returned; so
substr("ABC", 1, 0) = "" and substr("ABC", -4, 6) = "A".
---

To support running zfs_diff_timestamp with both gawk and mawk change
the second parameter passed to substr().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7503 
Closes #7510
2018-05-06 20:42:29 -07:00
Paul Dagnelie 64c1dcefe3 OpenZFS 9421, 9422 - zdb show possibly leaked objects
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on
     the deleted queue

It is possible for zfs to "leak" objects in such a way that they are not
freed, but are also not accessible via the POSIX interface. As the only
way to know that this is happened is to see one of them directly in a
zdb run, or by noting unaccounted space usage, zdb should be enhanced to
count these objects and return failure if some are detected.

We have access to the delete queue through the zfs_get_deleteq function;
we should call it in dump_znode to determine if the object is on the
delete queue. This is not the most efficient possible method, but it is
the simplest to implement, and should suffice for the common case where
there few objects on the delete queue.

Also zfs diff and zdb currently traverse every single dnode in a dataset
and tries to figure out the path of the object by following it's parent.
When an object is placed on the delete queue, for all practical purposes
it's already discarded, it's parent might not exist anymore, and another
object might now have the object number that belonged to the parent.
While all of the above makes sense, when trying to figure out the path
of an object that is on the delete queue, we can run into issues where
either it is impossible to determine the path because the parent is
gone, or another dnode has taken it's place and thus we are returned a
wrong path.

We should therefore avoid trying to determine the path of an object on
the delete queue and mark the object itself as being on the delete queue
to avoid confusion. To achieve this, we currently have two ideas:

1. When putting an object on the delete queue, change it's parent object
   number to a known constant that means NULL.

2. When displaying objects, first check if it is present on the delete
   queue.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9421
OpenZFS-issue: https://illumos.org/issues/9422
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9ca
Closes #7500
2018-05-04 10:50:24 -07:00
Matthew Ahrens 5e097c67f1 OpenZFS 9443 - panic when scrub a v10 pool
While expanding stored pools, we ran into a panic using an old pool.

Steps to reproduce:

    $ sudo zpool create -o version=2 test c2t1d0
    $ sudo cp /etc/passwd /test/foo
    $ sudo zpool attach test c2t1d0 c2t2d0

We'll get this panic:

    ffffff000fc0e5e0 unix:real_mode_stop_cpu_stage2_end+b27c ()
    ffffff000fc0e6f0 unix:trap+dc8 ()
    ffffff000fc0e700 unix:cmntrap+e6 ()
    ffffff000fc0e860 zfs:dsl_scan_visitds+1ff ()
    ffffff000fc0ea20 zfs:dsl_scan_visit+fe ()
    ffffff000fc0ea80 zfs:dsl_scan_sync+1b3 ()
    ffffff000fc0eb60 zfs:spa_sync+435 ()
    ffffff000fc0ec20 zfs:txg_sync_thread+23f ()
    ffffff000fc0ec30 unix:thread_start+8 ()

The problem is a bad trap accessing a NULL pointer. We're looking for
the dp_origin_snap of a dsl_pool_t, but version 2 didn't have that. The
system will go into a reboot loop at this point, and the dump won't be
accessible except by removing the cache file from within the recovery
environment.

This impacts any sort of scrub or resilver on version <11 pools, e.g.:

    $ zpool create -o version=10 test c2t1d0
    $ zpool scrub test

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9443
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/010eed29
Closes #7501
2018-05-04 10:47:10 -07:00
DeHackEd 609b242542 Fix inverted check for --enable-pyzfs
The --{en,dis}able-pyzfs check is backwards. Fix that.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: DHE <git@dehacked.net>
Closes #7493
2018-05-03 11:10:26 -07:00
Tony Hutter 1a62a305be Fedora 28: Add BuildRequires: libtirpc-devel
Add "BuildRequires: libtirpc-devel" to fix mock builds on Fedora 28.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7494 
Closes #7495
2018-05-03 10:47:46 -07:00
Tom Caputi be9a5c355c Add support for decryption faults in zinject
This patch adds the ability for zinject to trigger decryption
and authentication faults in the ZIO and ARC layers. This
functionality is exposed via the new "decrypt" error type, which
may be provided for "data" object types.

This patch also refactors some of the core encryption / decryption
functions so that they have consistent prototypes, handle errors
consistently, and do not have unused arguments.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7474
2018-05-02 15:36:20 -07:00
Brian Behlendorf 84a80d5f2d Fix undefined RPM macros
Always invoke the SPL_AC_DEBUG* macro's when running configure
so RPM_DEFINE_COMMON is correctly expanded.  A similar change
was already applied to ZFS.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #703
2018-05-02 15:34:20 -07:00
Brian Behlendorf 9464b9591e RHEL 7.5 compat: FMODE_KABI_ITERATE
As of RHEL 7.5 the mainline fops.iterate() method was added to
the file_operations structure and is correctly detected by the
configure script.

Normally this is what we want, but in order to maintain KABI
compatibility the RHEL change additionally does the following:

* Requires that callers intending to use this extended interface
  set the FMODE_KABI_ITERATE flag on the file structure when
  opening the directory.
* Adds the fops.iterate() method to the end of the structure,
  without removing fops.readdir().

This change updates the configure check to ignore the RHEL 7.5+
variant of fops.iterate() when detected.  Instead fallback to
the fops.readdir() interface which will be available.

Finally, add the 'zpl_' prefix to the directory context wrappers
to avoid colliding with the kernel provided symbols when both
the fops.iterate() and fops.readdir() are provided by the kernel.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7460 
Closes #7463
2018-05-02 15:01:24 -07:00
Brian Behlendorf bc8a6a60e9 Fix inst_num overflow in qat_crypt.c
This patch fixes the same issue which was previously addressed in
6051.  The variable "inst_num" was of the incorrect type and
"atomic_inc_32_nv()" could cause an overflow damaging its neighbor.

Cast the return value of atomic_inc_32_nv() to Cpa32U.

Fix a few types for num_inst for clarity.

Reviewed-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7468
2018-05-01 20:44:24 -07:00
Tom Caputi 2c24b5b148 Fix issues found with zfs diff
Two deadlocks / ASSERT failures were introduced in a2c2ed1b which
would occur whenever arc_buf_fill() failed to decrypt a block of
data. This occurred because the call to arc_buf_destroy() which
was responsible for cleaning up the newly created buffer would
attempt to take out the hdr lock that it was already holding. This
was resolved by calling the underlying functions directly without
retaking the lock.

In addition, the dmu_diff() code did not properly ensure that keys
were loaded and mapped before begining dataset traversal. It turns
out that this code does not need to look at any encrypted values,
so the code was altered to perform raw IO only.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7354 
Closes #7456
2018-05-01 11:24:20 -07:00
Tomohiro Kusumi d6133fc500 Silence compile-time warning on unused variable
ASSERT3U() could be NOP which then leads to having unused pointer *spa.

metaslab.c: In function 'metaslab_condense':
metaslab.c:2075:9: warning: unused variable 'spa' [-Wunused-variable]
  spa_t *spa = msp->ms_group->mg_vd->vdev_spa;

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7489
2018-05-01 11:15:54 -07:00
loli10K 85ce3f4fd1 Adopt pyzfs from ClusterHQ
This commit introduces several changes:

 * Update LICENSE and project information

 * Give a good PEP8 talk to existing Python source code

 * Add RPM/DEB packaging for pyzfs

 * Fix some outstanding issues with the existing pyzfs code caused by
   changes in the ABI since the last time the code was updated

 * Integrate pyzfs Python unittest with the ZFS Test Suite

 * Add missing libzfs_core functions: lzc_change_key,
   lzc_channel_program, lzc_channel_program_nosync, lzc_load_key,
   lzc_receive_one, lzc_receive_resumable, lzc_receive_with_cmdprops,
   lzc_receive_with_header, lzc_reopen, lzc_send_resume, lzc_sync,
   lzc_unload_key, lzc_remap

Note: this commit slightly changes zfs_ioc_unload_key() ABI. This allow
to differentiate the case where we tried to unload a key on a
non-existing dataset (ENOENT) from the situation where a dataset has
no key loaded: this is consistent with the "change" case where trying
to zfs_ioc_change_key() from a dataset with no key results in EACCES.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7230
2018-05-01 10:33:35 -07:00
Andriy Gapon 6abf922574 Import pyzfs source code from ClusterHQ
libzfs_core is intended to be a stable interface for programmatic
administration of ZFS.

This wrapper provides one-to-one wrappers for libzfs_core API functions,
but the signatures and types are more natural to Python.
nvlists are wrapped as dictionaries or lists depending on their usage.
Some parameters have default values depending on typical use for
increased convenience.
Enumerations and bit flags become strings and lists of strings in
Python.
Errors are reported as exceptions rather than integer errno-style
error codes.  The wrapper takes care to provide one-to-many mapping
of the error codes to the exceptions by interpreting a context
in which the error code is produced.

Unit tests and automated test for the libzfs_core API are provided
with this package.

Please note that the API tests perform lots of ZFS dataset level
operations and ZFS tries hard to ensure that any modifications
do reach stable storage. That means that the operations are done
synchronously and that, for example, disk caches are flushed.
Thus, the tests can be very slow on real hardware.
It is recommended to place the default temporary directory or
a temporary directory specified by, for instance, TMP environment
variable on a memory backed filesystem.

Original-patch-by: Andriy Gapon <avg@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7230
2018-05-01 10:31:11 -07:00
LOLi 3cbe89b12a Fix zfs incremental send remove '-o' properties
When receiving an incremental send stream with intermediary snapshots
zfs_receive_one() does not correctly identify the top-level dataset:
consequently we restore said snapshots as if they were children
datasets in the hierarchy, forcing inheritance of any property received
with 'zfs send -o' and effectively removing any locally set value.

The test case did not correctly verify this situation because it uses
adjacent snapshots, basically testing 'zfs send -i' instead of
'zfs send -I': this commit adds an additional intermediary snapshot to
the test script.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7478
2018-04-30 20:58:29 -07:00
Alexander Motin 20507534d4 OpenZFS 9434 - Speculative prefetch is blocked by device removal code
Device removal code does not set spa_indirect_vdevs_loaded for pools
that never experienced device removal.  At least one visual consequence
of it is completely blocked speculative prefetcher.  This patch sets
the variable in such situations.

Authored by: Alexander Motin <mav@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9434
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/16127b627b
Closes #7480
2018-04-30 13:05:55 -07:00
George Melikov eb201f50ac Add back iostat -y or -w descriptions
The iostat -y and -w descriptions were left in cda0317e,
get them back.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #7479 
Closes #7483
2018-04-30 13:42:58 -05:00
Antonio Russo c83ccb3e72 Add test with two kinds of file creation orders
Data loss was identified in #7401 when many small files were copied.
This adds a reproducer for this bug and other similar ones: randomly
generate N files. Then, listing M of them by `ls -U` order, produce
those same files in a directory of the same name.

This triggers the bug consistently, provided N and M are large enough.
Here, N=2^16 and M=2^13.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #7411
2018-04-30 12:45:47 -05:00
Matthew Ahrens 964c2d69a9 OpenZFS 9236 - nuke spa_dbgmsg
We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least,
metaslab_condense() should call zfs_dbgmsg because it's important and
rare enough to always log. It's possible that the message in
zio_dva_allocate() would be too high-frequency for zfs_dbgmsg.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Patch Notes:
* Removed ZFS_DEBUG_SPA from zfs-module-parameters.5

OpenZFS-issue: https://www.illumos.org/issues/9236
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cfaba7f668
Closes #7467
2018-04-30 10:19:48 -07:00
Mark Wright 089500e792 Fix CONFIG_GCC_PLUGIN_RANDSTRUCT build
Fix build errors with gcc 7.3.0 on Gentoo with kernel 4.16.3
built with CONFIG_GCC_PLUGIN_RANDSTRUCT=y such as:

module/zfs/vdev_indirect.c:296:2: error:
positional initialization of field in ‘struct’ declared with
‘designated_init’ attribute [-Werror=designated-init]
  vdev_indirect_map_free,
  ^~~~~~~~~~~~~~~~~~~~~~

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Mark Wright <gienah@gentoo.org>
Closes #7464
2018-04-20 09:53:25 -07:00
LOLi b4555c777a Fix 'zfs remap <poolname@snapname>'
Only filesystems and volumes are valid 'zfs remap' parameters: when
passed a snapshot name zfs_remap_indirects() does not handle the
EINVAL returned from libzfs_core, which results in failing an assertion
and consequently crashing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7454
2018-04-19 09:45:17 -07:00
Chunwei Chen 599b864813 Fix ENOSPC in "Handle zap_add() failures in ..."
Commit cc63068 caused ENOSPC error when copy a large amount of files
between two directories. The reason is that the patch limits zap leaf
expansion to 2 retries, and return ENOSPC when failed.

The intent for limiting retries is to prevent pointlessly growing table
to max size when adding a block full of entries with same name in
different case in mixed mode. However, it turns out we cannot use any
limit on the retry. When we copy files from one directory in readdir
order, we are copying in hash order, one leaf block at a time. Which
means that if the leaf block in source directory has expanded 6 times,
and you copy those entries in that block, by the time you need to expand
the leaf in destination directory, you need to expand it 6 times in one
go. So any limit on the retry will result in error where it shouldn't.

Note that while we do use different salt for different directories, it
seems that the salt/hash function doesn't provide enough randomization
to the hash distance to prevent this from happening.

Since cc63068 has already been reverted. This patch adds it back and
removes the retry limit.

Also, as it turn out, failing on zap_add() has a serious side effect for
mzap_upgrade(). When upgrading from micro zap to fat zap, it will
call zap_add() to transfer entries one at a time. If it hit any error
halfway through, the remaining entries will be lost, causing those files
to become orphan. This patch add a VERIFY to catch it.

Reviewed-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7401 
Closes #7421
2018-04-18 14:19:50 -07:00
Tom Caputi b0ee5946aa Fix issues with raw sends of spill blocks
This patch fixes 2 issues in how spill blocks are processed during
raw sends. The first problem is that compressed spill blocks were
using the logical length rather than the physical length to
determine how much data to dump into the send stream. The second
issue is a typo that caused the spill record's object number to be
used where the objset's ID number was required. Both issues have
been corrected, and the payload_size is now printed in zstreamdump
for future debugging.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7378 
Closes #7432
2018-04-17 11:19:03 -07:00
Tom Caputi e14a32b1c8 Fix object reclaim when using large dnodes
Currently, when the receive_object() code wants to reclaim an
object, it always assumes that the dnode is the legacy 512 bytes,
even when the incoming bonus buffer exceeds this length. This
causes a buffer overflow if --enable-debug is not provided and
triggers an ASSERT if it is. This patch resolves this issue and
adds an ASSERT to ensure this can't happen again.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7097
Closes #7433
2018-04-17 11:13:57 -07:00
Matthew Ahrens 0c03d21ac9 assertion in arc_release() during encrypted receive
In the existing code, when doing a raw (encrypted) zfs receive, 
we call arc_convert_to_raw() from open context. This creates a 
race condition between arc_release()/arc_change_state() and 
writing out the block from syncing context (arc_write_ready/done()).

This change makes it so that when we are doing a raw (encrypted) 
zfs receive, we save the crypt parameters (salt, iv, mac) of dnode 
blocks in the dbuf_dirty_record_t, and call arc_convert_to_raw() 
from syncing context when writing out the block of dnodes.

Additionally, we can eliminate dr_raw and associated setters, and 
instead know that dnode blocks are always raw when doing a zfs 
receive (see the new field os_raw_receive).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7424 
Closes #7429
2018-04-17 11:06:54 -07:00
bunder2015 b40d45bc6c ZTS: fix reservation_013_pos integer overflow
When using large disks the integers for calculating sizes can
overflow past 2**31.  Changing to long integers with typeset
should correct this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #4444 
Closes #7451
2018-04-17 10:52:53 -07:00
Matthew Ahrens 7f96cc23ac OpenZFS 9192 - explicitly pass good_writes to vdev_uberblock/label_sync
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume
that its io_private is a pointer to the good_writes count. They should
instead accept this argument explicitly.

OpenZFS-issue: https://www.illumos.org/issues/9192
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f4c0b602d
Closes #7446
2018-04-17 10:45:47 -07:00
Matt Ahrens d830d4795a OpenZFS 9280 - Assertion failure while running removal_with_ganging test with 4K devices
Authored by: Matt Ahrens <Matt.Ahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9280
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/243952c
Closes #7445
2018-04-17 10:44:50 -07:00
Tony Hutter 73d08ace52 Exclude python scripts from RPM shebang check
The newest Fedora packaging rules print warnings for scripts using the
/usr/bin/python shebang:

  *** WARNING: mangling shebang in /usr/src/spl-0.7.0/cmd/splslab/splslab.py
  from #!/usr/bin/python to #!/usr/bin/python2. This will become an ERROR,
  fix it manually!

Fedora wants all cross compatible scripts to pick python3.  Since we
don't want our users to have to pick a specific version of python, we
exclude our scripts from the RPM build check.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes: #699 
Closes: #700
2018-04-16 15:40:14 -07:00
megari d68ac65eb6 Revert "OpenZFS 9036 - zfs: duplicate 'const' declaration specifier"
This reverts commit cbb8933215.

The original change in OpenZFS 9036 did remove duplicate 'const'
specifiers, but the ZoL port had already done what *should* have been
done in OpenZFS 9036, which is to make the pointers themselves const.
The port of the change to ZoL ended up doing an unnecessary removal
of the constness of the pointers. Undo that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Ari Sundholm <ari@tuxera.com>
Closes #7444
2018-04-16 12:44:40 -07:00
Pavel Zakharov 9eb7b46ed0 OpenZFS 7638 - Refactor spa_load_impl into several functions
Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7638
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1fd3785ff6
Closes #7437
2018-04-16 12:24:23 -07:00
bunder2015 3eba666332 ZTS: zpool_create_002 clean up leftover filedisk
zpool_create_002_pos did not clean up filedisk files left over from
running the test.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7435
Closes #7439
2018-04-15 15:17:44 -07:00
Tim Chase 5284f43a1e Avoid Linux hung task message in ZTHR
Use an interruptible to avoid Linux hung task message in
ZTHR and to prevent inflating the load average.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7440 
Closes #7441
2018-04-15 15:12:28 -07:00
Toomas Soome 5e567da987 OpenZFS 9213 - zfs: sytem typo
Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: C Fraire <cfraire@me.com>
Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
* The additional instances of this typo addressed in the OpenZFS
  patch were already resolved.

OpenZFS-issue: https://illumos.org/issues/9213
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/edc8ef7d92
Closes #7436
2018-04-15 10:59:13 -07:00
Toomas Soome cbb8933215 OpenZFS 9036 - zfs: duplicate 'const' declaration specifier
Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9036
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f02c28e434
Closes #6900
2018-04-14 12:40:52 -07:00
Prakash Surya eecdd8e884 OpenZFS 9084 - spa_*_ashift must ignore spare devices
It's possible for the following assertion to be tripped when
running ztest:

    assertion failed for thread 0xf09fca40, thread-id 549:
    spa->spa_max_ashift == spa->spa_min_ashift (0xc == 0x9),
    file ../../../uts/common/fs/zfs/vdev_removal.c, line 965

    > $c
    libc.so.1`_lwp_kill+7(ebdde6c0, ebdde6c0, a9, fee7865e)
    libc.so.1`_assfail+0x214(ebddea28, fed7ac3c, 3c5, fef62000)
    libc.so.1`assfail3+0xde(fed7b130, c, 0, fed812cb, 9, 0)
    libzpool.so.1`spa_vdev_copy_impl+0x26b(89a4b40, ebddef74,
        ebddef68, 8992dc0, ebe10a00, fef073c0)
    libzpool.so.1`spa_vdev_remove_thread+0x6cd(87450c0, 0, 0, fee8f43a)
    libc.so.1`_thrp_setup+0x8c(f09fca40)
    libc.so.1`_lwp_start(f09fca40, 0, 0, 0, 0, 0)

    > ::spa -v
    ADDR         STATE NAME
    08723000    ACTIVE ztest

        ADDR     STATE     AUX          DESCRIPTION
        087466c0 HEALTHY   -            root
        087450c0 HEALTHY   -              /rpool/tmp/ztest.0a
        08745640 HEALTHY   -              indirect
        08745bc0 HEALTHY   -              /rpool/tmp/ztest.2a
        08746140 HEALTHY   -              /rpool/tmp/ztest.3a
        -        -         -            spares
        08744b40 HEALTHY   -              /rpool/tmp/ztest.spares.0

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/9084
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/18acba7
Closes #6900
2018-04-14 12:40:52 -07:00
Serapheim Dimitropoulos 4bf8108ede OpenZFS 9080 - recursive enter of vdev_indirect_rwlock from vdev_indirect_remap()
Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://illumos.org/issues/9080
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bdfded42e6
Closes #6900
2018-04-14 12:40:47 -07:00
Serapheim Dimitropoulos 9d5b524597 OpenZFS 9079 - race condition in starting and ending condensing thread for indirect vdevs
The timeline of the race condition is the following:

[1] Thread A is about to finish condesing the first vdev in
    spa_condense_indirect_thread(), so it calls the
    spa_condense_indirect_complete_sync() sync task which sets
    the spa_condensing_indirect field to NULL. Waiting for the
    sync task to finish, thread A sleeps until the txg is done.
    When this happens, thread A will acquire spa_async_lock and
    set spa_condense_thread to NULL.

[2] While thread A waits for the txg to finish, thread B which is
    running spa_sync() checks whether it should condense the
    second vdev in vdev_indirect_should_condense() by checking the
    spa_condensing_indirect field which was set to NULL by
    spa_condense_indirect_thread() from thread A. So it goes on
    and tries to spawn a new condensing thread in
    spa_condense_indirect_start_sync() and the aforementioned
    assertions fails because thread A has not set spa_condense_thread
    to NULL (which is basically the last thing it does before returning).

The main issue here is that we rely on both spa_condensing_indirect
and spa_condense_thread to signify whether a condensing thread is
running. Ideally we would only use one throughout the codebase. In
addition, for managing spa_condense_thread we currently use
spa_async_lock which basically tights condensing to scrubing when
it comes to pausing and resuming those actions during spa export.

This commit introduces the ZTHR infrastructure, which is basically
threads created during spa_load()/spa_create() and exist until we
export or destroy the pool. ZTHRs sleep the majority of the time,
until they are notified to wake up and do some predefined type of work.

In the context of the current bug, a zthr to does the condensing of
indirect mappings replacing the older code that used bare kthreads.
When a pool is created, the condensing zthr is spawned but sleeps
right away, until it is awaken by a signal from spa_sync(). If an
existing pool is loaded, the condensing zthr looks if there is
anything to condense before going to sleep, in case we were condensing
mappings in the pool before it got exported.

The benefits of this solution are the following:
- The current bug is fixed
- spa_condensing_indirect is the sole indicator of whether we are
  currently condensing or not
- condensing is more decoupled from the spa_async_thread related
  functionality.

As a final note, this commit also sets up the path on upstreaming
other features that use the ZTHR code like zpool checkpoint and
fast clone deletion.

Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9079
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3dc606ee
Closes #6900
2018-04-14 12:23:53 -07:00
Brian Behlendorf 4589f3ae4c Optimize possible split block search space
Remove duplicate segment copies to minimize the possible search
space for reconstruction.  Once reduced an accurate assessment can
be made regarding the difficulty in reconstructing the block.

Also, ztest will now run zdb with
zfs_reconstruct_indirect_combinations_max set to 1000000 in an attempt
to avoid checksum errors.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6900
2018-04-14 12:22:43 -07:00
Matthew Ahrens 9e052db462 OpenZFS 9290 - device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk
failure and silent damage (e.g. some data on disk is not right, but ZFS
hasn't detected the whole device as being broken). However, the current
device removal implementation bypasses some of the mirror's redundancy.
Note that in no case is incorrect data returned, but we might get a
checksum error when we should have been able to find the right data.

There are two underlying problems:

1. When we remove a mirror device, we only read one side of the mirror.
Since we can't verify the checksum, this side may be silently bad, but
the good data is on the other side of the mirror (which we didn't read).
This can cause the removal to "bake in" the busted data – all copies of
the data in the new location are the same, busted version, while we left
the good version behind.

The fix for this is to read and copy both sides of the mirror. If the
old and new vdevs are mirrors, we will read both sides of the old
mirror, and write each copy to the corresponding side of the new mirror.
(If the old and new vdevs have a different number of children, we will
do this as best as possible.) Even though we aren't verifying checksums,
this ensures that as long as there's a good copy of the data, we'll have
a good copy after the removal, even if there's silent damage to one side
of the mirror. If we're removing a mirror that has some silent damage,
we'll have exactly the same damage in the new location (assuming that
the new location is also a mirror).

2. When we read from an indirect vdev that points to a mirror vdev, we
only consider one copy of the data. This can lead to reduced effective
redundancy, because we might read a bad copy of the data from one side
of the mirror, and not retry the other, good side of the mirror.

Note that the problem is not with the removal process, but rather after
the removal has completed (having copied correct data to both sides of
the mirror), if one side of the new mirror is silently damaged, we
encounter the problem when reading the relocated data via the indirect
vdev. Also note that the problem doesn't occur when ZFS knows that one
side of the mirror is bad, e.g. when a disk entirely fails or is
offlined.

The impact is that reads (from indirect vdevs that point to mirrors) may
return a checksum error even though the good data exists on one side of
the mirror, and scrub doesn't repair all data on the mirror (if some of
it is pointed to via an indirect vdev).

The fix for this is complicated by "split blocks" - one logical block
may be split into two (or more) pieces with each piece moved to a
different new location. In this case we need to read all versions of
each split (one from each side of the mirror), and figure out which
combination of versions results in the correct checksum, and then repair
the incorrect versions.

This ensures that we supply the same redundancy whether you use device
removal or not. For example, if a mirror has small silent errors on all
of its children, we can still reconstruct the correct data, as long as
those errors are at sufficiently-separated offsets (specifically,
separated by the largest block size - default of 128KB, but up to 16MB).

Porting notes:

* A new indirect vdev check was moved from dsl_scan_needs_resilver_cb()
  to dsl_scan_needs_resilver(), which was added to ZoL as part of the
  sequential scrub work.

* Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t
  parameter.  The extra parameter is unique to ZoL.

* When posting indirect checksum errors the ABD can be passed directly,
  zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://illumos.org/issues/9290
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591
Closes #6900
2018-04-14 12:21:39 -07:00
Matthew Ahrens a1d477c24c OpenZFS 7614, 9064 - zfs device evacuation/removal
OpenZFS 7614 - zfs device evacuation/removal
OpenZFS 9064 - remove_mirror should wait for device removal to complete

This project allows top-level vdevs to be removed from the storage pool
with "zpool remove", reducing the total amount of storage in the pool.
This operation copies all allocated regions of the device to be removed
onto other devices, recording the mapping from old to new location.
After the removal is complete, read and free operations to the removed
(now "indirect") vdev must be remapped and performed at the new location
on disk.  The indirect mapping table is kept in memory whenever the pool
is loaded, so there is minimal performance overhead when doing operations
on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become "obsolete" because they are no longer used by any block pointers
in the pool.  An entry becomes obsolete when all the blocks that use
it are freed.  An entry can also become obsolete when all the snapshots
that reference it are deleted, and the block pointers that reference it
have been "remapped" in all filesystems/zvols (and clones).  Whenever an
indirect block is written, all the block pointers in it will be "remapped"
to their new (concrete) locations if possible.  This process can be
accelerated by using the "zfs remap" command to proactively rewrite all
indirect blocks that reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of
the data that is copied.  This makes the process much faster, but if it
were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be
possible to copy the wrong data, when we have the correct data on e.g.
the other side of the mirror.

At the moment, only mirrors and simple top-level vdevs can be removed
and no removal is allowed if any of the top-level vdevs are raidz.

Porting Notes:

* Avoid zero-sized kmem_alloc() in vdev_compact_children().

    The device evacuation code adds a dependency that
    vdev_compact_children() be able to properly empty the vdev_child
    array by setting it to NULL and zeroing vdev_children.  Under Linux,
    kmem_alloc() and related functions return a sentinel pointer rather
    than NULL for zero-sized allocations.

* Remove comment regarding "mpt" driver where zfs_remove_max_segment
  is initialized to SPA_MAXBLOCKSIZE.

  Change zfs_condense_indirect_commit_entry_delay_ticks to
  zfs_condense_indirect_commit_entry_delay_ms for consistency with
  most other tunables in which delays are specified in ms.

* ZTS changes:

    Use set_tunable rather than mdb
    Use zpool sync as appropriate
    Use sync_pool instead of sync
    Kill jobs during test_removal_with_operation to allow unmount/export
    Don't add non-disk names such as "mirror" or "raidz" to $DISKS
    Use $TEST_BASE_DIR instead of /tmp
    Increase HZ from 100 to 1000 which is more common on Linux

    removal_multiple_indirection.ksh
        Reduce iterations in order to not time out on the code
        coverage builders.

    removal_resume_export:
        Functionally, the test case is correct but there exists a race
        where the kernel thread hasn't been fully started yet and is
        not visible.  Wait for up to 1 second for the removal thread
        to be started before giving up on it.  Also, increase the
        amount of data copied in order that the removal not finish
        before the export has a chance to fail.

* MMP compatibility, the concept of concrete versus non-concrete devices
  has slightly changed the semantics of vdev_writeable().  Update
  mmp_random_leaf_impl() accordingly.

* Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool
  feature which is not supported by OpenZFS.

* Added support for new vdev removal tracepoints.

* Test cases removal_with_zdb and removal_condense_export have been
  intentionally disabled.  When run manually they pass as intended,
  but when running in the automated test environment they produce
  unreliable results on the latest Fedora release.

  They may work better once the upstream pool import refectoring is
  merged into ZoL at which point they will be re-enabled.

Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Reece <alex@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Tim Chase <tim@chase2k.com>

OpenZFS-issue: https://www.illumos.org/issues/7614
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb
Closes #6900
2018-04-14 12:16:17 -07:00
Tim Chase 4b0f5b2d7b Wait for resilver after online
This test performs a rapid offline/online cycle of each of several
mirror vdevs.  It can run so quickly that there isn't sufficient pool
redundancy to perform an offline.  The solution is to wait until the
pool is resilvered following the online operation.

Also, add a pool sync before the offline operation to help reduce
spurious errors.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #6900
2018-04-13 18:04:10 -07:00
Tim Chase 5c596ba7a4 Eliminate trailing spaces in DISKS
The zfs-tests.sh driver script could add spaces to the end of $DISKS
which defeates shell-based parsing with constructs such as ${DISKS##* }.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #6900
2018-04-13 18:03:58 -07:00
Seth Forshee 93b43af10d Allow mounting datasets more than once
Currently mounting an already mounted zfs dataset results in an
error, whereas it is typically allowed with other filesystems.
This causes some bad interactions with mount namespaces. Take
this sequence for example:

- Create a dataset
- Create a snapshot of the dataset
- Create a clone of the snapshot
- Create a new mount namespace
- Rename the original dataset

The rename results in unmounting and remounting the clone in the
original mount namespace, however the remount fails because the
dataset is still mounted in the new mount namespace. (Note that
this means the mount in the new mount namespace is never being
unmounted, so perhaps the unmount/remount of the clone isn't
actually necessary.)

The problem here is a result of the way mounting is implemented
in the kernel module. Since it is not mounting block devices it
uses mount_nodev() instead of the usual mount_bdev(). However,
mount_nodev() is written for filesystems for which each mount is
a new instance (i.e. a new super block), and zfs should be able
to detect when a mount request can be satisfied using an existing
super block.

Change zpl_mount() to call sget() directly with it's own test
callback. Passing the objset_t object as the fs data allows
checking if a superblock already exists for the dataset, and in
that case we just need to return a new reference for the sb's
root dentry.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Seth Forshee <seth.forshee@canonical.com>
Closes #5796
Closes #7207
2018-04-13 10:44:05 -07:00
bunder2015 1e37dee03f ZTS: clean up leftover ibackup_trunc files
zfs_receive_raw_incremental did not clean up ibackup_trunc.* files
left over from running the test.

Also changing the path of the ibackup files so they can be placed
in the correct directories when /var/tmp is not the temporary
directory.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7430
2018-04-13 10:35:55 -07:00
Brian Behlendorf d6bb22171b Linux compat 4.16: blk_queue_flag_{set,clear}
The HAVE_BLK_QUEUE_WRITE_CACHE_GPL_ONLY case was overlooked in
the original 10f88c5c commit because blk_queue_write_cache()
was available for the in-kernel builds.

Update the blk_queue_flag_{set,clear} wrappers to call the locked
versions to avoid confusion.  This is safe for all existing callers.

The blk_queue_set_write_cache() function has been updated to use
these wrappers.  This means setting/clearing both QUEUE_FLAG_WC
and QUEUE_FLAG_FUA is no longer atomic but this only done early
in zvol_alloc() prior to any requests so there is no issue.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Kash Pande <kash@tripleback.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7428 
Closes #7431
2018-04-12 19:46:14 -07:00
LOLi 7fab636188 Add 'zpool split' coverage to the ZFS Test Suite
This change adds five new tests to the ZTS:

 * zpool_split_cliargs: verify command line options and arguments
 * zpool_split_devices: verify zpool split accepts a device list
 * zpool_split_encryption: verify zpool can split encrypted pools
 * zpool_split_props: verify zpool split can set property values
 * zpool_split_vdevs: verify vdev layout when splitting the pool

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7409
2018-04-12 10:57:24 -07:00
Tomohiro Kusumi 8111eb4abc Fix calloc(3) arguments order
calloc(3) takes `nelem` (or `nmemb` in glibc) first, and then size of
elements.  No difference expected for having these in reverse order,
however should follow the standard.

http://pubs.opengroup.org/onlinepubs/009695399/functions/calloc.html

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7405
2018-04-12 10:50:39 -07:00
beren12 7403d0743e Fix zfs_arc_max minimum tuning
When setting `zfs_arc_max` its minimum value is allowed
to be 64 MiB.  There was an off-by-1 error which can matter
on tiny systems.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Zubrzycki <github@mid-earth.net>
Closes #7417
2018-04-12 10:47:32 -07:00
Mike Gerdts d22f3a8244 OpenZFS 9286 - want refreservation=auto
Authored by: Mike Gerdts <mike.gerdts@joyent.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Don Brady <don.brady@delphix.com>

Porting Notes:
* Adopted destroy_dataset in ZTS test cleanup
* Use ksh shebang instead of bash for new tests

OpenZFS-issue: https://www.illumos.org/issues/9286
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/723d0c85
Closes #7387
2018-04-11 14:52:13 -07:00
LOLi 9966754ac5 Fix zpool set feature@<feature>=disabled
Commit e4010f2 accidentally allows zpool to set pool features to
"disabled"; this should only be allowed at pool creation. This commit
adds additional checks and test coverage to 'zpool set'.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7402
2018-04-11 14:45:58 -07:00
DeHackEd dfb1ad027f zfs(8): fix dedup omission during mdoc overhaul
The property description has been updated with new algorithms as well.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: DHE <git@dehacked.net>
Closes #7377
2018-04-10 17:19:01 -07:00
Tom Caputi edc1e713c2 Fix race in dnode_check_slots_free()
Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable, leading to crashes. This patch adds the ability for
dnodes to track which txg they were last dirtied in and adds a
check for this before performing the reclaim.

This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7147 
Closes #7388
2018-04-10 11:15:05 -07:00
Giuseppe Di Natale 10f88c5cd5 Linux compat 4.16: blk_queue_flag_{set,clear}
queue_flag_{set,clear}_unlocked are now private interfaces in
the Linux kernel (https://github.com/torvalds/linux/commit/8a0ac14).
Use blk_queue_flag_{set,clear} interfaces which were introduced as
of https://github.com/torvalds/linux/commit/8814ce8.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7410
2018-04-10 10:32:14 -07:00
Tom Caputi 74df0c5e25 Correct swapped keylocation error messages
This patch corrects a small issue where two error messages
in the code that checks for invalid keylocations were
swapped.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7418
2018-04-09 21:11:17 -07:00
Giuseppe Di Natale 9125f8f5bd Linux compat 4.16: SECTOR_SIZE
As of https://github.com/torvalds/linux/commit/233bde21,
SECTOR_SIZE is defined in linux/blkdev.h. Define SECTOR_SIZE
in sunldi.h only if it's not already defined.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #697
2018-04-09 17:20:06 -07:00
Tony Hutter 4f301661df Revert "Handle zap_add() failures in mixed ... "
This reverts commit cc63068e95.

Under certain circumstances this change can result in an ENOSPC
error when adding new files to a directory.  See #7401 for full
details.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #7401 
Cloes #7416
2018-04-09 14:24:46 -07:00
Brian Behlendorf 3b0d99289a Fix 'zfs send/recv' hang with 16M blocks
When using 16MB blocks the send/recv queue's aren't quite big
enough.  This change leaves the default 16M queue size which a
good value for most pools.  But it additionally ensures that the
queue sizes are at least twice the allowed zfs_max_recordsize.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7365 
Closes #7404
2018-04-08 19:41:15 -07:00
Giuseppe Di Natale 7b47628acb Clean up (k)shlib and cfg file shebangs
Most kshlib files are imported by other scripts
and do not have a shebang at the top of their files.
Make all kshlib follow this convention.

Remove shebangs from cfg files as well.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Close #7406
2018-04-08 19:37:22 -07:00
Tony Hutter 6c9af9e8f4 Fix "file is executable, but no shebang" warnings
Fedora 28's RPM build checks warn when executable files don't have a
shebang line.  These warnings are caused when we (incorrectly)
include data & config files in the_SCRIPTS automake lines. Files in
_SCRIPTS are marked executable by automake. This patch fixes the
issue by including non-executable scripts in a _DATA line instead.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7359 
Closes #7395
2018-04-06 16:34:21 -07:00
Tony Hutter 812323bb03 Exclude python scripts from RPM shebang check
The newest Fedora packaging rules print warnings for scripts using the
/usr/bin/python shebang:

    *** WARNING: mangling shebang in /usr/bin/arc_summary.py from
    #!/usr/bin/python to #!/usr/bin/python2. This will become an ERROR,
    fix it manually!

Fedora wants all cross compatible scripts to pick python3.  Since we
don't want our users to have to pick a specific version of python, we
exclude our scripts from the RPM build check.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7360 
Closes #7399
2018-04-06 16:32:58 -07:00
Antonio Russo 55d80e651a systemd mount generator and tracking ZEDLET
zfs-mount-generator implements the "systemd generator" protocol,
producing systemd.mount units from the cached outputs of zfs list,
during early boot, integrating with systemd.

Each pool has an indpendent cache of the command

  zfs list -H -oname,mountpoint,canmount -tfilesystem -r $pool

which is kept synchronized by the ZEDLET

  history_event-zfs-list-cacher.sh

Datasets not in the cache will be loaded later in the boot process by
zfs-mount.service, including pools without a cache.

Among other things, this allows for complex mount hierarchies.

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #7329
2018-04-06 14:11:09 -07:00
Matthew Ahrens 5c27ec1088 Fixes for SNPRINTF_BLKPTR with encrypted BP's
mdb doesn't have dmu_ot[], so we need a different mechanism for its
SNPRINTF_BLKPTR() to determine if the BP is encrypted vs authenticated.

Additionally, since it already relies on BP_IS_ENCRYPTED (etc),
SNPRINTF_BLKPTR might as well figure out the "crypt_type" on its own,
rather than making the caller do so.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7390
2018-04-06 13:30:26 -07:00
Olaf Faaland 0ba106e75c Fix divide-by-zero in mmp_delay_update()
vdev_count_leaves() in the denominator may return 0, caught by Coverity.
Introduced by

* 533ea04 Update mmp_delay on sync or skipped, failed write

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7391
2018-04-06 13:29:11 -07:00
Tom Caputi 1bf9a552bb Make encrypted "zfs mount -a" failures consistent
Currently, "zfs mount -a" will print a warning and fail to mount
any encrypted datasets that do not have a key loaded. This patch
makes the behavior of this failure consistent with other failure
modes ("zfs mount -a" will silently continue, explict "zfs mount"
will print a message and return an error code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7382
2018-04-06 13:28:15 -07:00
Olaf Faaland 533ea0415b Update mmp_delay on sync or skipped, failed write
When an MMP write is skipped, or fails, and time since
mts->mmp_last_write is already greater than mts->mmp_delay, increase
mts->mmp_delay.  The original code only updated mts->mmp_delay when a
write succeeded, but this results in the write(s) after delays and
failed write(s) reporting an ub_mmp_delay which is too low.

Update mmp_last_write and mmp_delay if a txg sync was successful.  At
least one uberblock was written, thus extending the time we can be sure
the pool will not be imported by another host.

Do not allow mmp_delay to go below (MSEC2NSEC(zfs_multihost_interval) /
vdev_count_leaves()) so that a period of frequent successful MMP writes,
e.g. due to frequent txg syncs, does not result in an import activity
check so short it is not reliable based on mmp thread writes alone.

Remove unnecessary local variable, start.  We do not use the start time
of the loop iteration.

Add a debug message in spa_activity_check() to allow verification of the
import_delay value and to prove the activity check occurred.

Alter the tests that import pools and attempt to detect an activity
check.  Calculate the expected duration of spa_activity_check() based on
module parameters at the time the import is performed, rather than a
fixed time set in mmp.cfg.  The fixed time may be wrong.  Also, use the
default zfs_multihost_interval value so the activity check is longer and
easier to recognize.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7330
2018-04-04 16:38:44 -07:00
Tony Hutter 21a4f5cc86 Fedora 28: Fix misc bounds check compiler warnings
Fix a bunch of (mostly) sprintf/snprintf truncation compiler
warnings that show up on Fedora 28 (GCC 8.0.1).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7361 
Closes #7368
2018-04-04 10:16:47 -07:00
Brian Behlendorf 581bc01a07 Remove sysevents
These headers are provided in the ZFS repository and never used
by the SPL.  Remove them to ensure the right ones are included.

Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #696
2018-04-04 09:54:20 -07:00
LOLi 1724eb62de Fix spa reference leak in zfs_ioc_pool_scan
zfs_ioc_pool_scan leaks a spa reference when zc->zc_flags is not a
valid pool_scrub_cmd_t: this could happen if the userland binaries
and ZFS kernel module differ in version and would prevent the pool from
being exported.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7380
2018-04-03 17:31:30 -07:00
LOLi f119d00c1f Fix add_nested_replacing_spare test case
Use 'zpool reopen' instead of 'zpool scrub' to kick in the spare device:
this is required to avoid spurious failures caused by a race condition
in events processing by the ZFS Event Daemon:

P1 (zpool scrub)                            P2 (zed)
---
zfs_ioc_pool_scan()
 -> dsl_scan()
  -> vdev_reopen()
   -> vdev_set_state(VDEV_STATE_CANT_OPEN)
                                            zfs_ioc_vdev_attach()
                                             -> spa_vdev_attach()
                                              -> dsl_resilver_restart()
  -> dsl_sync_task()
   -> dsl_scan_setup_check()
   <- dsl_scan_setup_check(): EBUSY

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7247 
Closes #7342
2018-04-03 17:30:14 -07:00
Tim Chase 10adee27ce Remove ASSERT() in l2arc_apply_transforms()
The ASSERT was erroneously copied from the next section of code.
The buffer's size should be expanded from "psize" to "asize"
if necessary.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7375
2018-03-31 15:14:21 -07:00
Tom Caputi a2c2ed1bd4 Decryption error handling improvements
Currently, the decryption and block authentication code in
the ZIO / ARC layers is a bit inconsistent with regards to
the ereports that are produces and the error codes that are
passed to calling functions. This patch ensures that all of
these errors (which begin as ECKSUM) are converted to EIO
before they leave the ZIO or ARC layer and that ereports
are correctly generated on each decryption / authentication
failure.

In addition, this patch fixes a bug in zio_decrypt() where
ECKSUM never gets written to zio->io_error.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7372
2018-03-31 11:12:51 -07:00
Tom Caputi 4515b1d01c Encrypted dnode blocks should be prefetched raw
Encrypted dnode blocks are always initially read as raw data and
converted to decrypted data when an encrypted bonus buffer is
needed. This allows the DMU to be used for things like fetching
the DMU master node without requiring keys to be loaded. However,
dbuf_issue_final_prefetch() does not currently read the data as
raw. The end result of this is that prefetched dnode blocks are
read twice from disk: once decrypted and then again as raw data.
This patch corrects the issue by adding the flag when appropriate.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7362
2018-03-31 11:11:48 -07:00
LOLi 77d8a0f1a4 Fix hung z_zvol tasks during 'zfs receive'
During a receive operation zvol_create_minors_impl() can wait
needlessly for the prefetch thread because both share the same tasks
queue.  This results in hung tasks:

<3>INFO: task z_zvol:5541 blocked for more than 120 seconds.
<3>      Tainted: P           O  3.16.0-4-amd64
<3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

The first z_zvol:5541 (zvol_task_cb) is waiting for the long running
traverse_prefetch_thread:260

root@linux:~# cat /proc/spl/taskq
taskq                       act  nthr  spwn  maxt   pri  mina
spl_system_taskq/0            1     2     0    64   100     1
	active: [260]traverse_prefetch_thread [zfs](0xffff88003347ae40)
	wait: 5541
spl_delay_taskq/0             0     1     0     4   100     1
	delay: spa_deadman [zfs](0xffff880039924000)
z_zvol/1                      1     1     0     1   120     1
	active: [5541]zvol_task_cb [zfs](0xffff88001fde6400)
	pend: zvol_task_cb [zfs](0xffff88001fde6800)

This change adds a dedicated, per-pool, prefetch taskq to prevent the
traverse code from monopolizing the global (and limited) system_taskq by
inappropriately scheduling long running tasks on it.

Reviewed-by: Albert Lee <trisk@forkgnu.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6330 
Closes #6890 
Closes #7343
2018-03-30 12:10:01 -07:00
Georgy Yakovlev 2f291ebaed zfs-functions: skip lines where mntpnt is 'none'
This fixes zfs-mount initscript trying to mount swap volumes
as filesystems or anything that has 'none' as a mountpoint
in /etc/fstab.  Additionally, fixes trying to mount swap volumes
as a filesystem on RHEL.  RHEL defines mountpoint for swap
as `swap`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Georgy Yakovlev <ya@sysdump.net>
Closes #7346
Closes #7347
2018-03-30 12:05:24 -07:00
Andriy Gapon 5e00213e43 OpenZFS 9164 - assert: newds == os->os_dsl_dataset
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Porting Notes:
* Re-enabled and tweaked the zpool_upgrade_007_pos test case
  to successfully run in under 5 minutes.

OpenZFS-issue: https://www.illumos.org/issues/9164
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0e776dc06a
Closes #6112
Closes #7336
2018-03-30 12:00:40 -07:00
Don Brady 99f505a4d7 Add support for nvme based devids
Adds a devid for nvme devices. This is very similar to how the
other 'bus' (scsi|sata|usb) devids are generated. The devid 
resides in a name/value pair in the leaf vdevs in a zpool config.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7356
2018-03-29 17:43:25 -07:00
Tom Caputi 32dce2da0c Resolve QAT issues with incompressible data
Currently, when ZFS wants to accelerate compression with QAT, it
passes a destination buffer of the same size as the source buffer.
Unfortunately, if the data is incompressible, QAT can actually
"compress" the data to be larger than the source buffer. When this
happens, the QAT driver will return a FAILED error code and print
warnings to dmesg. This patch fixes these issues by providing the
QAT driver with an additional buffer to work with so that even
completely incompressible source data will not cause an overflow.

This patch also resolves an error handling issue where
incompressible data attempts compression twice: once by QAT and
once in software. To fix this issue, a new (and fake) error code
CPA_STATUS_INOMPRESSIBLE has been added so that the calling code
can correctly account for the difference between a hardware
failure and data that simply cannot be compressed.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7338
2018-03-29 17:40:34 -07:00
Tom Caputi 13a2ff2727 Fix ASSERT in dsl_scan_fini() and cleanup comments
This patch fixes an issue where dsl_scan_prefetch_cb() might
add more prefetch I/Os to the prefetch queue after prefetching
has been completed. This was happening because that code was
checking scn->scn_suspending instead of scn->scn_prefetch_stop.
This occasionally triggered an ASSERT during ztest runs in
dsl_scan_fini() when the code attempted to destroy an AVL tree
that still had entires in it. This patch also includes a number
of spelling corrections and comment cleanups throughout
dsl_scan.c

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7353
2018-03-28 18:30:44 -07:00
Tony Hutter d2812de6f7 chmod -x on etc/init.d/zfs-*.in automake files
Clear executable bit on zfs-import.in, zfs-mount.in,
zfs-share.in, and zfs-zed.in.  These are automake files and
should not be marked executable.  This fixes a RPM build error
on Fedora 28.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7355 
Closes #7327
2018-03-28 12:08:21 -07:00
Brian Behlendorf b2ab468dde Fix mmap / libaio deadlock
Calling uiomove() in mappedread() under the page lock can result
in a deadlock if the user space page needs to be faulted in.

Resolve the issue by dropping the page lock before the uiomove().
The inode range lock protects against concurrent updates via
zfs_read() and zfs_write().

Reviewed-by: Albert Lee <trisk@forkgnu.org>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7335 
Closes #7339
2018-03-28 10:19:22 -07:00
DeHackEd 668173b576 Remove libattr requirement
RHEL/CentOS 6 supports sys/xattr.h eliminating the need for
libattr-devel as a dependency.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #7344 
Closes #7351
2018-03-27 16:51:32 -07:00
Allan Jude 5152a74088 OpenZFS 9321 - arc_loan_compressed_buf() can increment arc_loaned_bytes by the wrong value
Authored by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9321
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/92b05f3a18
Closes #7333
2018-03-26 20:40:15 -07:00
Tony Hutter 9ea6c3d39d Fedora 28: Fix "Macro %_dracutdir has empty body"
If you run ./configure --with-config=srpm, it will not trigger
the user m4 scripts to populate the dracut and udev directories.
This causes a build error on Fedora 28.  Make the dracut and
udev lines conditional to get around this.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7326
Closes #7328
2018-03-25 15:00:47 -07:00
Tom Caputi 157ef7f6a5 Don't count embedded bps in read stats
Currently, ZFS tracks statistics about calls to arc_read()
via the /proc/spl/kstat/zfs/<pool>/reads file for debugging.
Unfortunately, this file currently counts embedded bps as
disk reads since they are technically processed by the ZIO
layer. This pollutes the log since the ARC will never cache
embedded bps. This patch  corrects this issue by preventing
the logging of embedded bp reads.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7334
2018-03-23 21:35:19 -07:00
Paul Dagnelie 387b6856d6 OpenZFS 9193 - bootcfg -C doesn't work
When given an empty string as a rootds value, bootcfg -C fails with
the error message 'could not set nextboot: '' is an invalid name'.
This should be allowed because it represents clearing the nextboot
configuration.

Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9193
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/504645d227
Closes #7230
2018-03-22 16:16:55 -07:00
Peter Ashford 910f3ce739 Clarify zpool actions for an intent log device
Updated the "Intent Log" section of the "zpool" manual page to
properly reflect the actions that may be performed.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Peter Ashford <ashford@accs.com>
Closes #6938 
Closes #7318
2018-03-22 15:12:08 -07:00
kpande 05747eca5b modprobe zfs during dracut mount
Resolves importing root pool during boot in dracut.  This case was
inadvertently broken with the module autoloading change in #7287.

Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Kash Pande <kash@tripleback.net>
Closes #7322
2018-03-22 10:14:29 -07:00
Matthew Ahrens 2fd92c3d6c enable zfs_dbgmsg() by default, without dprintf()
zfs_dbgmsg() should record a message by default.  As a general
principal, these messages shouldn't be too verbose.  Furthermore, the
amount of memory used is limited to 4MB (by default).

dprintf() should only record a message if this is a debug build, and
ZFS_DEBUG_DPRINTF is set in zfs_flags.  This flag is not set by default
(even on debug builds).  These messages are extremely verbose, and
sometimes nontrivial to compute.

SET_ERROR() should only record a message if ZFS_DEBUG_SET_ERROR is set
in zfs_flags.  This flag is not set by default (even on debug builds).

This brings our behavior in line with illumos.  Note that the message
format is unchanged (including file, line, and function, even though
these are not recorded on illumos).

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #7314
2018-03-21 15:37:32 -07:00
Tom Caputi 8d9e7c8fbe Fix spelling errors in comments
This patch simply corrects some spelling / grammar errors in
the QAT and encryption code comments. No functional changes

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7319
2018-03-21 08:42:13 -07:00
timor c66e54e9dc Add support for nvme disk detection
This treats /dev/nvme.. devices the same way as /dev/sd... devices.  The
motivation behind this is that whole disk detection did not work on nvme
SSDs without that, because it DKC_UNKNOWN was returned for such devices.

Perhaps there should be a separate DKC_ type for this, but I don't know
enough about the code to know the implications of that.

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: timor <timor.dd@googlemail.com>
Closes #7304
2018-03-21 08:35:20 -07:00
Tom Caputi 089fbf313c Add comments for portable dnode / objset flags
This patch adds some comments describing the purpose of "portable"
dnode and objset flags so that it is clear when new flags should
be added to the repective flag masks. This patch includes no
functional changes.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7313
2018-03-20 11:55:21 -07:00
Alek P 272b5d730f Add JSON output support to channel programs
The changes piggyback JSON output support on top of channel programs 
(#6558).  This way the JSON output support is targeted to scripting 
use cases and is easily maintainable since it really only touches 
one function (zfs_do_channel_program()).

This patch ports Joyent's JSON nvlist library from illumos to enable 
easy JSON printing of channel program output nvlist.  To keep the 
delta small I also took advantage of the fact that printing in
zfs_do_channel_program() was almost always done before exiting 
the program.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #7281
2018-03-19 12:40:58 -07:00
Brian Behlendorf a76f3d0437 Fix deadlock in IO pipeline
In vdev_queue_aggregate() the zio_execute() bypass should not be
called under the vdev queue lock.  This can result in a deadlock
as shown in the stack traces below.

Drop the vdev queue lock then walk the parents of the aggregate IO
to determine the list of component IOs to be bypassed.  This can
be done safely without holding the io_lock since the new aggregate
IO has not yet been returned and its parents cannot change.

---  THREAD 1 ---
arc_read()
  zio_nowait()
    zio_vdev_io_start()
      vdev_queue_io() <--- mutex_enter(vq->vq_lock)
        vdev_queue_io_to_issue()
          vdev_queue_aggregate()
            zio_execute()
              zio_vdev_io_assess()
                zio_wait_for_children() <- mutex_enter(zio->io_lock)

--- THREAD 2 --- (inverse order)
arc_read()
  zio_change_priority() <- mutex_enter(zio->zio_lock)
    vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock)

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7307
2018-03-16 16:46:06 -07:00
Tom Caputi 17dd88352e Allow QAT to handle non page-aligned buffers
This patch adds some handling to the QAT acceleration functions
that allows them to handle buffers that are not aligned with the
page cache. At the moment this never happens since callers only
happen to work with page-aligned buffers, but this code should
prevent headaches if this isn't always true in the future. This
patch also adds some cleanups to align the QAT compression code
with the encryption and checksumming code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7305
2018-03-16 16:02:49 -07:00
Olaf Faaland cec3a0a1bb Report pool suspended due to MMP
When the pool is suspended, record whether it was due to an I/O error or
due to MMP writes failing to succeed within the required time.

Change spa_suspended from uint8_t to zio_suspend_reason_t to store the
reason.

When userspace queries pool status via spa_tryimport(), report the
reason the pool was suspended in a new key,
ZPOOL_CONFIG_SUSPENDED_REASON.

In libzfs, when interpreting the returned config nvlist, report
suspension due to MMP with a new pool status enum value,
ZPOOL_STATUS_IO_FAILURE_MMP.

In status_callback(), which generates and emits the message when 'zpool
status' is executed, add a case to print an appropriate message for the
new pool status enum value.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7296
2018-03-15 10:56:55 -07:00
Tom Caputi 3874220932 SHA256 QAT acceleration
This patch enables acceleration of SHA256 checksums using Intel
Quick Assist Technology. This patch also fixes up and refactors
some of the code from QAT encryption to make the behavior
consistent.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7295
2018-03-15 10:53:58 -07:00
Stephen Blinick 8a2a9db8df OpenZFS 9076 - Adjust perf test concurrency settings
ZFS Performance test concurrency should be lowered for better latency

Work by Stephen Blinick.

Nightly performance runs typically consist of two levels of concurrency;
and both are fairly high.

Since the IO runs are to a ZFS filesystem, within a zpool, which is
based on some variable number of vdev's, the amount of IO driven to each
device is variable. Additionally, different device types (HDD vs SSD,
etc) can generally handle a different amount of concurrent IO before
saturating.

Nevertheless, in practice, it appears that most tests are well past the
concurrency saturation point and therefore both perform with the same
throughput, the maximum of the device. Because the queuedepth to the
device(s) is so high however, the latency is much higher than the best
possible at that throughput, and increases linearly with the increase in
concurrency.

This means that changes in code that impact latency during normal
operation (before saturation) may not be apparent when a large component
of the measured latency is from the IO sitting in a queue to be
serviced. Therefore, changing the concurrency settings is recommended

Authored by: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: John Wren Kennedy <jwk404@gmail.com>
Ported-by: John Wren Kennedy <jwk404@gmail.com>

OpenZFS-issue: https://www.illumos.org/issues/9076
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/562
Upstream bug: DLPX-45477
Closes #7302
2018-03-15 10:51:00 -07:00
Paul Zuchowski 1a2342784a receive_spill does not byte swap spill contents
In zfs receive, the function receive_spill should account
for spill block endian conversion as a defensive measure.
 
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7300
2018-03-15 10:29:51 -07:00
Brian Behlendorf de4f8d5d26 OpenZFS 9188 - increase size of dbuf cache to reduce indirect block decompression
With compressed ARC (bug 6950) we use up to 25% of our CPU to decompress
indirect blocks, under a workload of random cached reads. To reduce this
decompression cost, we would like to increase the size of the dbuf cache so
that more indirect blocks can be stored uncompressed.

If we are caching entire large files of recordsize=8K, the indirect blocks
use 1/64th as much memory as the data blocks (assuming they have the same
compression ratio). We suggest making the dbuf cache be 1/32nd of all memory,
so that in this scenario we should be able to keep all the indirect blocks
decompressed in the dbuf cache. (We want it to be more than the 1/64th that
the indirect blocks would use because we need to cache other stuff in the dbuf
cache as well.)

In real world workloads, this won't help as dramatically as the example above,
but we think it's still worth it because the risk of decreasing performance is
low. The potential negative performance impact is that we will be slightly
reducing the size of the ARC (by ~3%).

Porting Notes:
* Added modules options to zfs-module-parameters.5 man page.
* Preserved scaling based on target ARC size rather than max ARC size.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9188
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/564
Upstream bug: DLPX-46942
Closes #7273
2018-03-13 10:52:48 -07:00
Brian Behlendorf a6cc97566c Add kernel module auto-loading
Historically a dynamic misc minor number was registered for the
/dev/zfs device in order to prevent minor number collisions.  This
was fine but it prevented us from being able to use the kernel
module auto-loaded which requires a known reserved value.

Resolve this issue by adding a configure test to find an available
misc minor number which can then be used in MODULE_ALIAS_MISCDEV at
build time.  By adding this alias the zfs kmod is added to the list
of known static-nodes and the systemd-tmpfiles-setup-dev service
will create a /dev/zfs character device at boot time.

This in turn allows us to update the 90-zfs.rules file to make it
aware this is a static node.  The upshot of this is that whenever
a process (zpool, zfs, zed) opens the /dev/zfs the kmods will be
automatic loaded.  This even works for unprivileged users so there
is no longer a need to manually load the modules at boot time.

As an additional bonus the zed now no longer needs to start after
the zfs-import.service since it will trigger the module load.

In the unlikely event the minor number we selected conflicts with
another out of tree unregistered minor number the code falls back
to dynamically allocating it.  In this case the modules again
must be manually loaded.

Note that due to the change in the method of registering the minor
number the zimport.sh test case may incorrectly fail when the
static node for the installed packages is created instead of the
dynamic one.  This issue will only transiently impact zimport.sh
for this single commit when we transition and are mixing and
matching methods.

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
TEST_ZIMPORT_SKIP="yes"
Closes #7287
2018-03-13 10:45:55 -07:00
Tim Chase 02638a30ef Add zfs_scan_ignore_errors tunable
When it's set, a DTL range will be cleared even if its scan/scrub had
errors.  This allows to work around resilver/scrub upon import when the
pool has errors.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7293
2018-03-13 10:43:14 -07:00
Paul Zuchowski 83362e8e67 Destroy makes full snap list before destroying
Change zfs destroy logic so destroying begins before
the entire list of snapshots is built.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #7271
2018-03-12 15:24:08 -07:00
Chunwei Chen 9470cbd4f9 Fix race in trace point in zrl_add_impl
We hit an illegal memory access in the zrlock trace point. The problem
is that zrl->zr_owner and zrl->zr_caller are assigned locklessly. And if
zrl->zr_owner got assigned a longer string between when __string()
calculate the strlen, and when __assign_str() does strcpy. The copy will
overflow the buffer.

==
For example:

Initial condition:
zrl->zr_owner = A
zrl->zr_caller = "abc"

Thread A                                 Thread B
-------------------------------------------------
if (zrl->zr_owner == A) {
  DTRACE_PROBE2() {
    __string() {
      strlen(zrl->zr_caller) -> 3
      allocate buf[4]
    }

                                        zrl->zr_owner = B
				        zrl->zr_caller = "abcd"

    __assign_str() {
      strcpy(buf, zrl->zr_caller) <- buffer overflow
==

Dereferencing zrl->zr_owner->pid may also be problematic, in that the
zrl->zr_owner got changed to other task, and that task exits, freeing
the task_struct. This should be very unlikely, as the other task need to
zrl_remove and exit between the dereferencing zr->zr_owner and
zr->zr_owner->pid. Nevertheless, we'll deal with it as well.

To fix the zrl->zr_caller issue, instead of copy the string content, we
just copy the pointer, this is safe because it always points to
__func__, which is static. As for the zrl->zr_owner issue, we pass in
curthread instead of using zrl->zr_owner.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7291
2018-03-12 11:27:02 -07:00
Brian Behlendorf b7eec00f9f Fix MMP write frequency for large pools
When a single pool contains more vdevs than the CONFIG_HZ for
for the kernel the mmp thread will not delay properly.  Switch
to using cv_timedwait_sig_hires() to handle higher resolution
delays.

This issue was reported on Arch Linux where HZ defaults to only
100 and this could be fairly easily reproduced with a reasonably
large pool.  Most distribution kernels set CONFIG_HZ=250 or
CONFIG_HZ=1000 and thus are unlikely to be impacted.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7205 
Closes #7289
2018-03-12 11:26:05 -07:00
Olaf Faaland 743253df70 Hold SCL_VDEV when counting leaves
A config lock should be held while vdev_count_leaves() walks the tree,
otherwise the pointers reference may become invalid during the walk.

SCL_VDEV is a minimal lock provided for such uses cases.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7286
2018-03-09 15:42:54 -08:00
Olaf Faaland ebed90a598 Handle zio_resume and mmp => off
When multihost is disabled on a pool, and the pool is resumed via zpool
clear, within a single cycle of the mmp thread's loop (e.g.  while it's
in the cv_timedwait call), both mmp_last_write and mmp_delay should be
updated.

The original code mistakenly treated the two cases as if they could not
occur at the same time.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7286
2018-03-09 15:42:11 -08:00
Tomohiro Kusumi 5ee220ba5c Document allowed pool names
PR #7208 was a patch to allow non-reserved pool names which begin with
mirror, raidz, spare (but do not equal), however we'd rather document
it in the man page for compatibility with other OpenZFS implementations,
to avoid pool names that may not work on non-Linux platforms.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7216
2018-03-09 14:04:15 -08:00
LOLi c45c6d9212 Fix zfs-kmod builds when using rpm >= 4.14
With rpm-software-management/rpm@5e94633 a package version containing
invalid characters (most commonly a double '-') causes the kmod package
generation to terminate with an error.  This change takes advantage of
the newly introduced rpm macro "_wrong_version_format_terminate_build"
to allow kmod packages to be built.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  loli10K <ezomori.nozomu@gmail.com>
Closes #7284
2018-03-09 13:52:37 -08:00
LOLi 43983eb202 Fix spl-kmod builds when using rpm >= 4.14
With rpm-software-management/rpm@5e94633 a package version containing
invalid characters (most commonly a double '-') causes the kmod package
generation to terminate with an error.  This change takes advantage of
the newly introduced rpm macro "_wrong_version_format_terminate_build"
to allow kmod packages to be built.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:  loli10K <ezomori.nozomu@gmail.com>
Closes #691
2018-03-09 13:51:31 -08:00
Tomohiro Kusumi 6b8655ad3f Change functions which return literals to return const char*
get_format_prompt_string() and zpool_state_to_name() return
a string literal which is read-only, thus they should return
`const char*`.

zpool_get_prop_string() returns a non-const string after
successful nv-lookup, and returns a string literal otherwise.
Since this function is designed to be used for read-only purpose,
the return type should also be `const char*`.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7285
2018-03-09 13:47:32 -08:00
Tom Caputi cf63739191 QAT support for AES-GCM
This patch adds support for acceleration of AES-GCM encryption
with Intel Quick Assist Technology.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7282
2018-03-09 13:37:15 -08:00
Paul Zuchowski 8e5d14844d zdb and inuse tests don't pass with real disks
Due to zpool create auto-partioning in Linux (i.e. sdb1),
certain utilities need to use the parition (sdb1) while
others use the whole disk name (sdb).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #6939 
Closes #7261
2018-03-07 17:03:33 -08:00
Wolfgang Bumiller 0e85048f53 Take user namespaces into account in policy checks
Change file related checks to use user namespaces and make
sure involved uids/gids are mappable in the current
namespace.

Note that checks without file ownership information will
still not take user namespaces into account, as some of
these should be handled via 'zfs allow' (otherwise root in a
user namespace could issue commands such as `zpool export`).

This also adds an initial user namespace regression test
for the setgid bit loss, with a user_ns_exec helper usable
in further tests.

Additionally, configure checks for the required user
namespace related features are added for:
  * ns_capable
  * kuid/kgid_has_mapping()
  * user_ns in cred_t

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Closes #6800 
Closes #7270
2018-03-07 15:40:42 -08:00
Brian Behlendorf 434a3375ce ZTS: fix send-c_stream_size_estimate
The test could fail when attempting to write to a newly created
volume which was missing its device node.  Resolve the issue by
calling block_device_wait() which blocks until udev creates the
needed  entry.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7276 
Closes #7277
2018-03-07 09:55:54 -08:00
Giuseppe Di Natale a07ad58847 Fix dbufstats_001_pos
Implement a new helper within_tolerance to test if a value
is within range of a target.

Because the dbufstats and dbufs kstat file are being read
at slightly different times, it is possible for stats to be
slightly off. Use within_tolerance to determine if the value
is "close enough" to the target.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7239 
Closes #7266
2018-03-07 09:53:04 -08:00
Tony Hutter 639b18944a Allow to limit zed's syslog chattiness
Some usage patterns like send/recv of replication streams can
produce a large number of events. In such a case, the current
all-syslog.sh zedlet will hold up to its name, and flood the
logs with mostly redundant information. Two mitigate this
situation, this changeset introduces to new variables
ZED_SYSLOG_SUBCLASS_INCLUDE and ZED_SYSLOG_SUBCLASS_EXCLUDE
to zed.rc that give more control over which event classes end
up in the syslog.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Daniel Kobras <d.kobras@science-computing.de>
Closes #6886 
Closes #7260
2018-03-06 15:41:52 -08:00
Olaf Faaland d2160d0538 Record skipped MMP writes in multihost_history
Once per pass through the MMP thread's loop, the vdev tree is walked to
find a suitable leaf to write the next MMP block to.  If no such leaf is
found, the thread sleeps for a while and resumes at the top of the loop.

Add an entry to multihost_history when no leaf can be found, and record
the reason in the error column.  The error code for such entries is a
bitfield, displayed in hex:

0x1  At least one vdev (interior or leaf) was not writeable.
0x2  At least one writeable leaf vdev was found, but it had a pending
MMP write.

timestamp = the time in seconds since the epoch when no leaf could be
found originally.

duration = the time (in ns) during which no MMP block was written for
this reason.  This does not include the preceeding inter-write period
nor the following inter-write period.

vdev_guid = the number of sequential cycles of the MMP thread looop when
this occurred.

Sample output, truncated to fit:

For records of skipped MMP writes the right-most column, vdev_path, is
reported as "-".

id   txg  timestamp   error  duration    mmp_delay  vdev_guid     ...
936  11   1520036441  0      146264      891422313  1740883117838 ...
937  11   1520036441  0      163956      888356657  7320395061548 ...
938  11   1520036442  0      130690      885314969  7320395061548 ...
939  11   1520036442  0      2001068577  882296582  1740883117838 ...
940  11   1520036443  0      161806      882296582  7320395061548 ...
941  11   1520036443  0x2    0           998020546  1             ...
942  11   1520036444  0      136585      998020546  7320395061548 ...
943  11   1520036444  0x2    0           998020257  1             ...
944  11   1520036445  5      2002662964  994160219  1740883117838 ...
945  11   1520036445  0x2    998073118   994160219  3             ...
946  11   1520036447  0      247136      994160219  7320395061548 ...

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7212
2018-03-06 15:15:15 -08:00
Olaf Faaland 14c240cede Detect long config lock acquisition in mmp
If something holds the config lock as a writer for too long, MMP will
fail to issue MMP writes in a timely manner.  This will result either in
the pool being suspended, or in an extreme case, in the pool not being
protected.

If the time to acquire the config lock exceeds 1/10 of the minimum
zfs_multihost_interval, report it in the zfs debug log.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7212
2018-03-06 15:14:39 -08:00
Giuseppe Di Natale c7b55e71b0 Introduce a destroy_dataset helper
Datasets can be busy when calling zfs destroy. Introduce
a helper function to destroy datasets and use it to destroy
datasets in zfs_allow_004_pos, zfs_promote_008_pos, and
zfs_destroy_002_pos.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7224 
Closes #7246 
Closes #7249 
Closes #7267
2018-03-06 14:54:57 -08:00
Nasf-Fan 2705ebf0a7 Misc fixes and cleanup for project quota
1) The Coverity Scan reports some issues for the project
   quota patch, including:

1.1) zfs_prop_get_userquota() directly uses the const quota
   type value as the condition check by wrong.

1.2) dmu_objset_userquota_get_ids() may cause dnode::dn_newgid
   to be overwritten by dnode::dn->dn_oldprojid.

2) This patch fixes related issues. It also enhances the logic
   for zfs_project_item_alloc() to avoid buffer overflow.

3) Skip project quota ability check if does not change project
   quota related things (id or flag). Otherwise, it will cause
   chattr (for other non project quota flags) operation failed
   if project quota disabled.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fan Yong <fan.yong@intel.com>
Closes #7251 
Closes #7265
2018-03-05 12:56:27 -08:00
Giuseppe Di Natale dd3e1e3083 Linux 4.16 compat: get_disk_and_module()
As of https://github.com/torvalds/linux/commit/fb6d47a, get_disk()
is now get_disk_and_module(). Add a configure check to determine
if we need to use get_disk_and_module().

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7264
2018-03-05 12:44:35 -08:00
Tony Hutter 80d52c3919 Change checksum & IO delay ratelimit values
Change checksum & IO delay ratelimit thresholds from 5/sec to 20/sec.
This allows zed to actually trigger if a bunch of these events arrive in
a short period of time (zed has a threshold of 10 events in 10 sec).
Previously, if you had, say, 100 checksum errors in 1 sec, it would get
ratelimited to 5/sec which wouldn't trigger zed to fault the drive.

Also, convert the checksum and IO delay thresholds to module params for
easy testing.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7252
2018-03-04 17:34:51 -08:00
chrisrd 5666a994f2 Increment zil_itx_needcopy_bytes properly
In zil_lwb_commit() with TX_WRITE, we copy the log write record (lrw)
into the log write block (lwb) and send it off using zil_lwb_add_txg().
If we also have WR_NEED_COPY, we additionally copy the lwr's data into
the lwb to be sent off.  If the lwr + data doesn't fit into the lwb, we
send the lrw and as much data as will fit (dnow bytes), then go back
and do the same with the remaining data.

Each time through this loop we're sending dnow data bytes. I.e.
zil_itx_needcopy_bytes should be incremented by dnow.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #6988 
Closes #7176
2018-03-02 10:01:53 -08:00
chrisrd d0f6fbaff3 ZTS: fix spurious failures in mv_files
The test could fail because of a race condition between the files being
generated in the background and attempting to move the files. Wait for
all file generation to complete before trying to move the files around.

Also, clean up the waiting: the 'wait' command without arguments waits
for all child pids.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #7220 
Closes #7242 
Closes #7258
2018-03-02 09:57:29 -08:00
John Wren Kennedy e086e717c3 Add ZFS perf test for dbuf cache
This change adds a test for sequential reads out of the dbuf cache.
It's essentially a copy of sequential_reads_cached, using a smaller
data set. The sequential read tests are renamed to differentiate them.

Authored by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Wren Kennedy <john.kennedy@delphix.com>
Closes #7225
2018-02-28 10:38:37 -08:00
John Eismeier d699aaef09 Fix some typos
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: John Eismeier <john.eismeier@gmail.com>
Closes #7237
2018-02-28 08:57:10 -08:00
Tomohiro Kusumi d72cd017dd Fix zpool(8) list example to match actual format
a05dfd00 (Illumos 5147) has swapped FRAG and EXPANDSZ,
so it's natural to modify these examples.

 # zpool list | head -1
 NAME     SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
                              ^^^^^^^^^^^^^^^

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #7244
2018-02-28 08:54:53 -08:00
Scot W. Stevenson 19528cf949 Add Python 3 rewrite of arc_summary.py
Add new script arc_summary3.py as a complete rewrite of the
arc_summary.py tool (see issue #6873)

Add new options:

        -g/--graph    - Display crude graphic representation
                        of ARC status and quit
        -r/--raw      - Print all available information as
                        minimally formatted list (for grep)
        -s/--section  - Print a single section. This
                        replaces -p/--page, which is kept for
                        backwards use but marked as
                        depreciated

Add new sections with information on ZIL and SPL. Notify user
if sections L2ARC and VDEV are skipped instead of failing
silently. Add warning that -p/--page option is depreciated.

Developed for Python 3.5.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes #6873 
Closes #6892
2018-02-28 08:52:34 -08:00
Tony Hutter 3e9c9d8a89 Add SMART self-test results to zpool status -c
Add in SMART self-test results to zpool status|iostat -c.  This
works for both SAS and SATA drives.

Also, add plumbing to allow the 'smart' script to take smartctl
output from a directory of output text files instead of running
it against the vdevs.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #7178
2018-02-27 09:31:27 -08:00
Tom Caputi 095495e008 Raw DRR_OBJECT records must write raw data
b1d21733 made it possible for empty metadnode blocks to be
compressed to a hole, fixing a bug that would cause invalid
metadnode MACs when a send stream attempted to free objects
and allowing the blocks to be reclaimed when they were no
longer needed. However, this patch also introduced a race
condition; if a txg sync occurred after a DRR_OBJECT_RANGE
record was received but before any objects were added, the
metadnode block would be compressed to a hole and lose all
of its encryption parameters. This would cause subsequent
DRR_OBJECT records to fail when they attempted to write
their data into an unencrypted block. This patch defers the
DRR_OBJECT_RANGE handling to receive_object() so that the
encryption parameters are set with each object that is
written into that block.

Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7215 
Closes #7236
2018-02-27 09:04:05 -08:00
Tim Chase 8b5814393f Incorrect maximum DVA value in DDE_GET_NDVAS()
The conditional was reversed which caused garbage values to be used when
calculating dds_ref_dsize.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #7234
2018-02-26 14:20:12 -08:00
LOLi 4af6873af6 Fix segfault in zfs_do_bookmark()
When invoked with wrong parameters 'zfs bookmark' fails to gracefully
validate user input and crashes.

This is a regression accidentally introduced in 587e228; this commit
adds additional tests to the ZFS Test Suite to exercise this codepath.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: KireinaHoro <i@jsteward.moe>
Signed-off-by:  loli10K <ezomori.nozomu@gmail.com>
Closes #7228 
Closes #7229
2018-02-26 09:55:18 -08:00
Brian Behlendorf 2a0428f16b ZTS: Fix zfs_share_* test case failures
Prevent false positives when running the zfs_share_* test
cases due to leftover stale /var/lib/nfs/etab entries.  When
starting the test group re-synchronize the /var/lib/nfs/etab
file with /etc/exports.  At this point in the testing there
will be no additional `zfs share` entries to add.

Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7226
2018-02-24 10:07:12 -08:00
Brian Behlendorf 3673d03285 Fix more cstyle warnings
This patch contains no functional changes.  It is solely intended
to resolve cstyle warnings in order to facilitate moving the spl
source code in to the zfs repository.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #687
2018-02-24 10:05:37 -08:00
Kash Pande 41532e5a29 Shellcheck cleanup for initrd scripts
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Co-authored-by: Kash Pande <kash@tripleback.net>
Co-authored-by: Matthew Thode <mthode@mthode.org>
Signed-off-by: Kash Pande <kash@tripleback.net>
Signed-off-by: Matthew Thode <mthode@mthode.org>
Closes #7214
2018-02-23 12:57:41 -08:00
Kash Pande 7280d58197 Enable booting from nested encrypted datasets
- enable booting from nested encrypted datasets
- fix plymouth boot splash passphrase entry
- optimize unlock process

Co-authored-by: Kash Pande <kash@tripleback.net>
Co-authored-by: Matthew Thode <mthode@mthode.org>
Signed-off-by: Kash Pande <kash@tripleback.net>
Signed-off-by: Matthew Thode <mthode@mthode.org>
Closes #7214
2018-02-23 12:57:28 -08:00
Tony Hutter bf95a000c4 Add scrub after resilver zed script
* Add a zed script to kick off a scrub after a resilver.  The script is
disabled by default.

* Add a optional $PATH (-P) option to zed to allow it to use a custom
$PATH for its zedlets.  This is needed when you're running zed under
the ZTS in a local workspace.

* Update test scripts to not copy in all-debug.sh and all-syslog.sh by
default.  They can be optionally copied in as part of zed_setup().
These scripts slow down zed considerably under heavy events loads and
can cause events to be dropped or their delivery delayed. This was
causing some sporadic failures in the 'fault' tests.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #4662 
Closes #7086
2018-02-23 11:38:05 -08:00
chrisrd e9a7729008 Fix free memory calculation on v3.14+
Provide infrastructure to auto-configure to enum and API changes in the
global page stats used for our free memory calculations.

arc_free_memory has been broken since an API change in Linux v3.14:

2016-07-28 v4.8 599d0c95 mm, vmscan: move LRU lists to node
2016-07-28 v4.8 75ef7184 mm, vmstat: add infrastructure for per-node
  vmstats

These commits moved some of global_page_state() into
global_node_page_state(). The API change was particularly egregious as,
instead of breaking the old code, it silently did the wrong thing and we
continued using global_page_state() where we should have been using
global_node_page_state(), thus indexing into the wrong array via
NR_SLAB_RECLAIMABLE et al.

There have been further API changes along the way:

2017-07-06 v4.13 385386cf mm: vmstat: move slab statistics from zone to
  node counters
2017-09-06 v4.14 c41f012a mm: rename global_page_state to
  global_zone_page_state

...and various (incomplete, as it turns out) attempts to accomodate
these changes in ZoL:

2017-08-24 2209e409 Linux 4.8+ compatibility fix for vm stats
2017-09-16 787acae0 Linux 3.14 compat: IO acct, global_page_state, etc
2017-09-19 661907e6 Linux 4.14 compat: IO acct, global_page_state, etc

The config infrastructure provided here resolves these issues going back
to the original API change in v3.14 and is robust against further Linux
changes in this area.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #7170
2018-02-23 08:50:06 -08:00
Olaf Faaland 7088545d01 Report duration and error in mmp_history entries
After an MMP write completes, update the relevant mmp_history entry
with the time between submission and completion, and the error
status of the write.

[faaland1@toss3a zfs]$ cat /proc/spl/kstat/zfs/pool/multihost
39 0 0x01 100 8800 69147946270893 72723903122926
id       txg     timestamp  error  duration   mmp_delay    vdev_guid
10607    1166    1518985089 0      138301     637785455    4882...
10608    1166    1518985089 0      136154     635407747    1151...
10609    1166    1518985089 0      803618560  633048078    9740...
10610    1166    1518985090 0      144826     633048078    4882...
10611    1166    1518985090 0      164527     666187671    1151...

Where duration = gethrtime_in_done_fn - gethrtime_at_submission, and
error = zio->io_error.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7190
2018-02-22 15:34:34 -08:00
Olaf Faaland 0d398b2564 Do not initiate MMP writes while pool is suspended
While the pool is suspended on host A, it may be imported on host B.
If host A continued to write MMP blocks, it would be blindly
overwriting MMP blocks written by host B, and the blocks written by
host A would have outdated txg information.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7182
2018-02-22 09:14:46 -08:00
Tony Hutter a5369b61a2 Linux 4.16 compat: use correct *_dec_and_test()
Use refcount_dec_and_test() on 4.16+ kernels, atomic_dec_and_test()
on older kernels.  https://lwn.net/Articles/714974/

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes: #7179 
Closes: #7211
2018-02-22 09:02:06 -08:00
Tom Caputi f8478fc2ca Fix bounds check in zio_crypt_do_objset_hmacs
The current bounds check in zio_crypt_do_objset_hmacs() does not
properly handle the possible sizes of the objset_phys_t and
can therefore read outside the buffer's memory. If that memory
happened to match what the check was actually looking for, the
objset would fail to be owned, complaining that the MAC was
invalid.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7210
2018-02-22 08:50:14 -08:00
Tomohiro Kusumi 378c6ed549 Fix function name typos
vn_init() and vn_fini() had been renamed by 12ff95ff in 2011.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #686
2018-02-21 15:12:10 -08:00
Tomohiro Kusumi 68386b0503 Staticize kstat_default_update()
This is only used via ->ks_update of `kstat_t *`.
This isn't exported nor do headers have its prototype.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #686
2018-02-21 15:11:38 -08:00
Toomas Soome 09302a4ca8 OpenZFS 9035 - zfs: this statement may fall through
Authored by: Toomas Soome <tsoome@me.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9035
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/46ac8fdfc5
Closes #7206
2018-02-21 14:55:34 -08:00
DeHackEd 2b5cd5990f Fix multiple evaluations of VERIFY() and ASSERT() on failures
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #684 
Closes #685
2018-02-21 14:54:26 -08:00
Matthew Thode a2819058f5 Allow modprobe to fail when called within systemd
This allows for systems with zfs built into the kernel manually to run
these services.  Otherwise the service will fail to start.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Thode <mthode@mthode.org>
Closes #7174
2018-02-21 14:45:35 -08:00
bunder2015 ca0b376604 Add SMART attributes for SSD and NVMe
This adds the SMART attributes required to probe Samsung SSD and NVMe
(and possibly others) disks when using the "zpool status -c" command.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #7183 
Closes #7193
2018-02-21 13:52:47 -08:00
chrisrd 26cb4b8791 Allow make checkstyle and paxscript in build dir
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #7202
2018-02-21 12:35:59 -08:00
LOLi faa97c1619 Want 'zfs send -b'
This change implements 'zfs send -b' which can be used to send only
received property values whether or not they are overridden by local
settings.

This can be very useful during "restore" operations from a backup pool
because it allows to send only the property values originally sent
from the backup source, even though they were later modified on the
destination either by a 'zfs set' operation, explicit 'zfs inherit' or
overridden during the receive process via 'zfs receive -o|-x'.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7156
2018-02-21 12:32:06 -08:00
Tom Caputi b0918402dc Raw receive should change key atomically
Currently, raw zfs sends transfer the encrypted master keys and
objset_phys_t encryption parameters in the DRR_BEGIN payload of
each send file. Both of these are processed as soon as they are
read in dmu_recv_stream(), meaning that the new keys are set
before the new snapshot is received. In addition to the fact that
this changes the user's keys for the dataset earlier than they
might expect, the keys were never reset to what they originally
were in the event that the receive failed. This patch splits the
processing into objset handling and key handling, the later of
which is moved to dmu_recv_end() so that they key change can be
done atomically.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7200
2018-02-21 12:31:03 -08:00
Tom Caputi 4a385862b7 Prevent raw zfs recv -F if dataset is unencrypted
The current design of ZFS encryption only allows a dataset to
have one DSL Crypto Key at a time. As a result, it is important
that the zfs receive code ensures that only one key can be in use
at a time for a given DSL Directory. zfs receive -F complicates
this, since the new dataset is received as a clone of the existing
one so that an atomic switch can be done at the end. To prevent
confusion about which dataset is actually encrypted a check was
added to ensure that encrypted datasets cannot use zfs recv -F to
completely replace existing datasets. Unfortunately, the check did
not take into account unencrypted datasets being overriden by
encrypted ones as a case.

Along the same lines, the code also failed to ensure that raw
recieves could not be done on top of existing unencrypted
datasets, which causes amny problems since the new stream cannot
be decrypted.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7199
2018-02-21 12:30:11 -08:00
Tom Caputi b1d217338a Raw receives must compress metadnode blocks
Currently, the DMU relies on ZIO layer compression to free LO
dnode blocks that no longer have objects in them. However,
raw receives disable all compression, meaning that these blocks
can never be freed. In addition to the obvious space concerns,
this could also cause incremental raw receives to fail to mount
since the MAC of a hole is different from that of a completely
zeroed block.

This patch corrects this issue by adding a special case in
zio_write_compress() which will attempt to compress these blocks
to a hole even if ZIO_FLAG_RAW_ENCRYPT is set. This patch also
removes the zfs_mdcomp_disable tunable, since tuning it could
cause these same issues.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7198
2018-02-21 12:28:52 -08:00
Tom Caputi 5121c4fb0c Remove unnecessary txg syncs from receive_object()
1b66810b introduced serveral changes which improved the reliability
of zfs sends when large dnodes were involved. However, these fixes
required adding a few calls to txg_wait_synced() in the DRR_OBJECT
handling code. Although most of them are currently necessary, this
patch allows the code to continue without waiting in some cases
where it doesn't have to.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7197
2018-02-21 12:26:51 -08:00
Tom Caputi 478b3150de Add omitted set for os->os_next_write_raw
This one line patch adds adds a set to os->os_next_write_raw
that was omitted when the code was updated in 1b66810. Without
it, the code (in some instances) could attempt to write raw
encrypted data as regular unencrypted data without the keys
being loaded, triggering an ASSERT in zio_encrypt().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7196
2018-02-21 12:24:37 -08:00
Giuseppe Di Natale f2c0dee23b Correct count_uberblocks in mmp.kshlib
A log_must call was causing count_uberblocks to return more
than just the uberblock count. Remove the log_must since it
was only logging a sleep.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7191
2018-02-20 16:28:52 -08:00
Tom Caputi 163a8c28dd ZIL claiming should not start user accounting
Currently, ZIL claiming dirties objsets which causes
dsl_pool_sync() to attempt to perform user accounting on
them. This causes problems for encrypted datasets that were
raw received before the system went offline since they
cannot perform user accounting until they have their keys
loaded. This triggers an ASSERT in zio_encrypt(). Since
encryption was added, the code now depends on the fact that
data should only be written when objsets are owned. This
patch adds a check in dmu_objset_do_userquota_updates()
to ensure that useraccounting is only done when the objsets
are actually owned for write. As part of this work, the
zfsvfs and zvol code was updated so that it no longer lies
about owning objsets readonly.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6916 
Closes #7163
2018-02-20 16:27:31 -08:00
Don Brady cbce581353 Fix coverity defects: zfs channel programs
CID 173243, 173245:  Memory - corruptions  (OVERRUN)
 Added size argument to lcompat_sprintf() to avoid use of INT_MAX

CID 173244:  Integer handling issues  (OVERFLOW_BEFORE_WIDEN)
 Added cast to uint64_t to avoid a 32 bit overflow warning

CID 173242:  Integer handling issues  (CONSTANT_EXPRESSION_RESULT)
 Conditionally removed unused luai_numisnan() floating point check

CID 173241:  Resource leaks  (RESOURCE_LEAK)
 Added missing close(fd) on error path

CID 173240:    (UNINIT)
Fixed uninitialized variable in get_special_prop()

CID 147560:  Null pointer dereferences  (NULL_RETURNS)
Cleaned up bad code merge in dsl_dataset_promote_check()

CID 28475:  Memory - illegal accesses  (OVERRUN)
Fixed lcompat_sprintf() to use a size paramater

CID 28418, 28422:  Error handling issues  (CHECKED_RETURN)
Added function result cast to (void) to avoid warning

CID 23935, 28411, 28412:  Memory - corruptions  (ARRAY_VS_SINGLETON)
Added casts to avoid exposing result as an array

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7181
2018-02-20 11:19:42 -08:00
Tom Caputi 7b30ee6baf Project dnode should be protected by local MAC
This patch corrects a small security issue with 9c5167d1. When the
project dnode was added to the objset_phys_t, it was not included
in the local MAC for cryptographic protection, allowing an attacker
to modify this data without the consent of the key holder. This
patch does represent an on-disk format change for anyone using
project dnodes on an encrypted dataset.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7177
2018-02-20 09:41:07 -08:00
chrisrd e921f6508b Fix config issues: frame size and headers
1. With various (debug and/or tracing?) kernel options enabled it's
possible for 'struct inode' and 'struct super_block' to exceed the
default frame size, leaving errors like this in config.log:

build/conftest.c:116:1: error: the frame size of 1048 bytes is larger
than 1024 bytes [-Werror=frame-larger-than=]

Fix this by removing the frame size warning for config checks

2. Without the correct headers included, it's possible for declarations
to be missed, leaving errors like this in the config.log:

build/conftest.c:131:14: error: ‘struct nameidata’ declared inside
parameter list [-Werror]

Fix this by adding appropriate headers.

Note: Both these issues can result in silent config failures because
the compile failure is taken to mean "this option is not supported by
this kernel" rather than "there's something wrong with the config
test". This can lead to something merely annoying (compile failures) to
something potentially serious (miscompiled or misused kernel primitives
or functions). E.g. the fixes included here resulted in these
additional defines in zfs_config.h with linux v4.14.19:

Also, drive-by whitespace fixes in config/* files which don't mention
"GNU" (those ones look to be imported from elsewhere so leave them
alone).

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #7169
2018-02-15 12:58:23 -08:00
Don Brady 62d5c55313 Address objtool check failures in lua module
The use of void __attribute__((noreturn)) in kernel builds
was causing lots of warnings if CONFIG_STACK_VALIDATION
is active. For now we just remove this attribute to achieve
clean builds for the Lua module. There was no significant
increase in the time to run the full channel_program ZTS tests.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #7173
2018-02-15 09:53:04 -08:00
Olaf Faaland ec7c1b914c Clarify zinject(8) explanation of -e
Error injection of EIO or ENXIO simply sets the zio's io_error value,
rather than preventing the read or write from occurring.  This is
important information as it affects how the probes must be used.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #7172
2018-02-15 09:50:06 -08:00
George Wilson ddc751d56b OpenZFS 8857 - zio_remove_child() panic due to already destroyed parent zio
PROBLEM
=======
It's possible for a parent zio to complete even though it has children
which have not completed. This can result in the following panic:
    > $C
    ffffff01809128c0 vpanic()
    ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80)
    ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80)
    ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0,
    ffffff3373370908)
    ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0)
    ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0)
    ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140)
    ffffff0180912b40 thread_start+8()
    > ::status
    debugging crash dump vmcore.2 (64-bit) from batfs0390
    operating system: 5.11 joyent_20170911T171900Z (i86pc)
    image uuid: (not set)
    panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80
    owner=ffffff3c59b39480 thread=ffffff0180912c40
    dump content: kernel pages only
The problem is that dbuf_prefetch along with l2arc can create a zio tree
which confuses the parent zio and allows it to complete with while children
still exist. Here's the scenario:
    zio tree:
        pio
         |--- lio
The parent zio, pio, has entered the zio_done stage and begins to check its
children to see there are still some that have not completed. In zio_done(),
the children are checked in the following order:
    zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE)
    zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE)
If pio, finds any child which has not completed then it stops executing and
goes to sleep. Each call to zio_wait_for_children() will grab the io_lock
while checking the particular child.
In this scenario, the pio has completed the first call to
zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since
the only zio in the zio tree right now is the logical zio, lio, then it
completes that call and prepares to check the next child type.
In the meantime, the lio completes and in its callback creates a child vdev
zio, cio. The zio tree looks like this:
    zio tree:
        pio
         |--- lio
         |--- cio
The lio then grabs the parent's io_lock and removes itself.
    zio tree:
        pio
         |--- cio
The pio continues to run but has already completed its check for ZIO_CHILD_VDEV
and will erroneously complete. When the child zio, cio, completes it will panic
the system trying to reference the parent zio which has been destroyed.
SOLUTION
========
The fix is to rework the zio_wait_for_children() logic to accept a bitfield
for all the children types that it's interested in checking. The
io_lock will is held the entire time we check all the children types. Since
the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided
to allow for the conversion between a ZIO_CHILD type and the bitfield used by
the zio_wiat_for_children logic.

Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Youzhong Yang <youzhong@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8857
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/862ff6d99c
Issue #5918
Closes #7168
2018-02-14 15:30:09 -08:00
loli10K 4411de2116 OpenZFS 8940 - Sending an intra-pool resumable send stream may result in EXDEV
Because resuming from a token requires "guid" -> "snapshot" mapping
we have to walk the whole dataset hierarchy to find the right snapshot
to send; when both source and destination exists, for an incremental
resumable stream, libzfs gets confused and picks up the wrong snapshot
to send from: this results in attempting to send
   "destination@snap1 -> source@snap2"
instead of
   "source@snap1 -> source@snap2"
which fails with a "Invalid cross-device link" error (EXDEV).
Fix this by adjusting the logic behind dataset traversal in
zfs_iter_children() to pick the right snapshot to send from.
Additionally update dry-run 'zfs send -t' to print its output to
stderr: this is consistent with other dry-run commands.

Patch Notes:
Reconciled differences between OpenZFS and
aee1dd4d98.

Authored by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.home-2000.org>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8940
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9f7867c206
Closes #7171
2018-02-14 14:35:04 -08:00
Nasf-Fan 9c5167d19f Project Quota on ZFS
Project quota is a new ZFS system space/object usage accounting
and enforcement mechanism. Similar as user/group quota, project
quota is another dimension of system quota. It bases on the new
object attribute - project ID.

Project ID is a numerical value to indicate to which project an
object belongs. An object only can belong to one project though
you (the object owner or privileged user) can change the object
project ID via 'chattr -p' or 'zfs project [-s] -p' explicitly.
The object also can inherit the project ID from its parent when
created if the parent has the project inherit flag (that can be
set via 'chattr +P' or 'zfs project -s [-p]').

By accounting the spaces/objects belong to the same project, we
can know how many spaces/objects used by the project. And if we
set the upper limit then we can control the spaces/objects that
are consumed by such project. It is useful when multiple groups
and users cooperate for the same project, or a user/group needs
to participate in multiple projects.

Support the following commands and functionalities:

zfs set projectquota@project
zfs set projectobjquota@project

zfs get projectquota@project
zfs get projectobjquota@project
zfs get projectused@project
zfs get projectobjused@project

zfs projectspace

zfs allow projectquota
zfs allow projectobjquota
zfs allow projectused
zfs allow projectobjused

zfs unallow projectquota
zfs unallow projectobjquota
zfs unallow projectused
zfs unallow projectobjused

chattr +/-P
chattr -p project_id
lsattr -p

This patch also supports tree quota based on the project quota via
"zfs project" commands set as following:
zfs project [-d|-r] <file|directory ...>
zfs project -C [-k] [-r] <file|directory ...>
zfs project -c [-0] [-d|-r] [-p id] <file|directory ...>
zfs project [-p id] [-r] [-s] <file|directory ...>

For "df [-i] $DIR" command, if we set INHERIT (project ID) flag on
the $DIR, then the proejct [obj]quota and [obj]used values for the
$DIR's project ID will be shown as the total/free (avail) resource.
Keep the same behavior as EXT4/XFS does.

Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by  Ned Bass <bass6@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fan Yong <fan.yong@intel.com>
TEST_ZIMPORT_POOLS="zol-0.6.1 zol-0.6.2 master"
Change-Id: Ib4f0544602e03fb61fd46a849d7ba51a6005693c
Closes #6290
2018-02-13 14:54:54 -08:00
LOLi c03f04708c 'zfs receive' fails with "dataset is busy"
Receiving an incremental stream after an interrupted "zfs receive -s"
fails with the message "dataset is busy": this is because we still have
the hidden clone ../%recv from the resumable receive.

Improve the error message suggesting the existence of a partially
complete resumable stream from "zfs receive -s" which can be either
aborted ("zfs receive -A") or resumed ("zfs send -t").

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7129 
Closes #7154
2018-02-12 12:28:59 -08:00
LOLi a893627fac contrib/initramfs: add missing conf.d/zfs
When upgrading from the distribution-provided zfs-initramfs package on
root-on-zfs Ubuntu and Debian the system may fail to boot: this change
adds the missing initramfs configuration file.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7158
2018-02-12 11:40:00 -08:00
sanjeevbagewadi 918dbe35b5 mmp should use a fixed tag for spa_config locks
mmp_write_uberblock() and mmp_write_done() should the same tag
for spa_config_locks.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Closes #6530 
Closes #7155
2018-02-12 11:30:38 -08:00
John Wren Kennedy ba779f7f71 OpenZFS 9004 - Some ZFS tests used files removed with 32 bit kernel
Authored by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/9004
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/fafe9b241f
Closes #7149
2018-02-09 10:28:44 -08:00
Andriy Gapon 1334283225 OpenZFS 8520 - lzc_rollback
8520 lzc_rollback_to should support rolling back to origin
7198 libzfs should gracefully handle EINVAL from lzc_rollback

lzc_rollback_to() should support rolling back to a clone's origin.
The current checks in zfs_ioc_rollback() would not allow that
because the origin snapshot belongs to a different filesystem.
The overly restrictive check was in introduced in 7600, but it
was not a regression as none of the existing tools provided a
way to rollback to the origin.

Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8520
OpenZFS-issue: https://www.illumos.org/issues/7198
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/78a5a1a25a
Closes #7150
2018-02-09 10:27:58 -08:00
sanjeevbagewadi cc63068e95 Handle zap_add() failures in mixed case mode
With "casesensitivity=mixed", zap_add() could fail when the number of
files/directories with the same name (varying in case) exceed the
capacity of the leaf node of a Fatzap. This results in a ASSERT()
failure as zfs_link_create() does not expect zap_add() to fail. The fix
is to handle these failures and rollback the transactions.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Closes #7011 
Closes #7054
2018-02-09 10:15:53 -08:00
Chunwei Chen eb9c4532dd Fix zdb -ed on objset for exported pool
zdb -ed on objset for exported pool would failed with:
  failed to own dataset 'qq/fs0': No such file or directory

The reason is that zdb pass objset name to spa_import, it uses that
name to create a spa. Later, when dmu_objset_own tries to lookup the spa
using real pool name, it can't find one.

We fix this by make sure we pass pool name rather than objset name to
spa_import.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7099
Closes #6464
2018-02-09 10:11:34 -08:00
Chunwei Chen 5e3bd0e684 Fix zdb -E segfault
SPA_MAXBLOCKSIZE is too large for stack.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7099
2018-02-09 10:11:19 -08:00
Chunwei Chen 950e17c215 Fix zdb -R decompression
There are some issues in the zdb -R decompression implementation.

The first is that ZLE can easily decompress non-ZLE streams. So we add
ZDB_NO_ZLE env to make zdb skip ZLE.

The second is the random bytes appended to pabd, pbuf2 stuff. This serve
no purpose at all, those bytes shouldn't be read during decompression
anyway. Instead, we randomize lbuf2, so that we can make sure
decompression fill exactly to lsize by bcmp lbuf and lbuf2.

The last one is the condition to detect fail is wrong.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7099
Closes #4984
2018-02-09 10:11:02 -08:00
Chunwei Chen 0c0b0ad48a Fix racy assignment of zcb.zcb_haderrors
zcb_haderrors will be modified in zdb_blkptr_done, which is
asynchronous. So we must move this assignment after zio_wait.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7099
2018-02-09 10:08:40 -08:00
Chunwei Chen f108a49236 Fix zle_decompress out of bound access
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7099
2018-02-09 10:08:05 -08:00
Chunwei Chen d3190c5f29 Fix zdb -c traverse stop on damaged objset root
If a corruption happens to be on a root block of an objset, zdb -c will
not correctly report the error, and it will not traverse the datasets
that come after. This is because traverse_visitbp, which does the
callback and reset error for TRAVERSE_HARD, is skipped when traversing
zil is failed in traverse_impl.

Here's example of what 'zdb -eLcc' command looks like on a pool with
damaged objset root:

== before patch:

Traversing all blocks to verify checksums ...

Error counts:

	errno  count
block traversal size 379392 != alloc 33987072 (unreachable 33607680)

	bp count:             172
	ganged count:           0
	bp logical:       1678336      avg:   9757
	bp physical:       130560      avg:    759     compression:  12.85
	bp allocated:      379392      avg:   2205     compression:   4.42
	bp deduped:             0    ref>1:      0   deduplication:   1.00
	SPA allocated:   33987072     used:  0.80%

	additional, non-pointer bps of type 0:         71
	Dittoed blocks on same vdev: 101

== after patch:

Traversing all blocks to verify checksums ...

zdb_blkptr_cb: Got error 52 reading <54, 0, -1, 0>  -- skipping

Error counts:

	errno  count
	   52  1
block traversal size 33963520 != alloc 33987072 (unreachable 23552)

	bp count:             447
	ganged count:           0
	bp logical:      36093440      avg:  80745
	bp physical:     33699840      avg:  75391     compression:   1.07
	bp allocated:    33963520      avg:  75981     compression:   1.06
	bp deduped:             0    ref>1:      0   deduplication:   1.00
	SPA allocated:   33987072     used:  0.80%

	additional, non-pointer bps of type 0:         76
	Dittoed blocks on same vdev: 115

==

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7099
2018-02-09 10:05:25 -08:00
John Wren Kennedy 35e0202fd7 OpenZFS 8965 - zfs_acl_ls_001_pos fails due to no longer supported grep regex
The test used \> to detect the end of a string, but this no longer works,
so use $ which works as well since the string ends the line anyway.

Authored by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Akash Ayare <aayare@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Yuri Pankov <yuripv@icloud.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8965
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cb1204e444
Closes #7145
2018-02-08 21:29:17 -08:00
Brian Behlendorf f54976dc88 Linux 4.11 compat: avoid refcount_t name conflict
Related to commit 4859fe796, when directly using the kernel's
refcount functions in kernel compatibility code do not map
refcount_t to zfs_refcount_t.  This leads to a type mismatch.

Longer term we should consider renaming refcount_t to
zfs_refcount_t in the zfs code base.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7148
2018-02-08 21:25:51 -08:00
Brian Behlendorf 18f57327e0 Linux 4.16 compat: inode_set_iversion()
A new interface was added to manipulate the version field of an
inode.  Add a inode_set_iversion() wrapper for older kernels and
use the new interface when available.

The i_version field was dropped from the trace point due to the
switch to an atomic64_t i_version type.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7148
2018-02-08 21:25:19 -08:00
Serapheim Dimitropoulos 5b72a38d68 OpenZFS 8677 - Open-Context Channel Programs
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

We want to be able to run channel programs outside of synching
context. This would greatly improve performance for channel programs
that just gather information, as they won't have to wait for synching
context anymore.

=== What is implemented?

This feature introduces the following:
- A new command line flag in "zfs program" to specify our intention
  to run in open context. (The -n option)
- A new flag/option within the channel program ioctl which selects
  the context.
- Appropriate error handling whenever we try a channel program in
  open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.

=== How do we handle zfs.sync functions in open context?

When such a function is found by the interpreter and we are running
in open context we abort the script and we spit out a descriptive
runtime error. For example, given the script below ...

arg = ...
fs = arg["argv"][1]
err = zfs.sync.destroy(fs)
msg = "destroying " .. fs .. " err=" .. err
return msg

if we run it in open context, we will get back the following error:

Channel program execution failed:
[string "channel program"]:3: running functions from the zfs.sync
submodule requires passing sync=TRUE to lzc_channel_program()
(i.e. do not specify the "-n" command line argument)
stack traceback:
            [C]: in function 'destroy'
            [string "channel program"]:3: in main chunk

=== What about testing?

We've introduced new wrappers for all channel program tests that
run each channel program as both (startard & open-context) and
expect the appropriate behavior depending on the program using
the zfs.sync module.

OpenZFS-issue: https://www.illumos.org/issues/8677
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/17a49e15
Closes #6558
2018-02-08 16:05:57 -08:00
Serapheim Dimitropoulos 8d103d8856 OpenZFS 8604 - Simplify snapshots unmounting code
Authored by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

Every time we want to unmount a snapshot (happens during snapshot
deletion or renaming) we unnecessarily iterate through all the
mountpoints in the VFS layer (see zfs_get_vfs).

The current patch completely gets rid of that code and changes
the approach while keeping the behavior of that code path the
same. Specifically, it puts a hold on the dataset/snapshot and
gets its vfs resource reference directly, instead of linearly
searching for it. If that reference exists we attempt to amount
it.

With the above change, it became obvious that the nvlist
manipulations that we do (add_boolean and add_nvlist) take a
significant amount of time ensuring uniqueness of every new
element even though they don't have too. Thus, we updated the
patch so those nvlists are not trying to enforce the uniqueness
of their elements.

A more complete analysis of the problem solved by this patch
can be found below:
https://sdimitro.github.io/post/snap-unmount-perf/

OpenZFS-issue: https://www.illumos.org/issues/8604
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/126118fb
2018-02-08 15:29:44 -08:00
Don Brady fc5d4b6737 Increase code coverage for Lua libraries
Add test coverage for lua libraries
Remove dead code in Lua implementation

Signed-off-by: Don Brady <don.brady@delphix.com>
2018-02-08 15:29:38 -08:00
Don Brady ee00bfb2e6 Add basic functional tests for zcp user properties
Signed-off-by: Don Brady <don.brady@delphix.com>
2018-02-08 15:29:32 -08:00
Chris Williamson 234c91c508 OpenZFS 8600 - ZFS channel programs - snapshot
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

ZFS channel programs should be able to create snapshots.
In addition to the base snapshot functionality, this entails extra
logic to handle edge cases which were formerly not possible, such as
creating then destroying a snapshot in the same transaction sync.

OpenZFS-issue: https://www.illumos.org/issues/8600
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68089b8b
2018-02-08 15:29:24 -08:00
Brad Lewis af07368986 OpenZFS 8592 - ZFS channel programs - rollback
Authored by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

ZFS channel programs should be able to perform a rollback.

OpenZFS-issue: https://www.illumos.org/issues/8592
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d46b5ed6
2018-02-08 15:29:14 -08:00
Chris Williamson 475eca4908 OpenZFS 8605 - zfs channel programs fix zfs.exists
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>

zfs.exists() in channel programs doesn't return any result, and should
have a man page entry. This patch corrects zfs.exists so that it
returns a value indicating if the dataset exists or not. It also adds
documentation about it in the man page.

OpenZFS-issue: https://www.illumos.org/issues/8605
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1e85e111
2018-02-08 15:28:52 -08:00
Chris Williamson d99a015343 OpenZFS 7431 - ZFS Channel Programs
Authored by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported-by: Don Brady <don.brady@delphix.com>
Ported-by: John Kennedy <john.kennedy@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/7431
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/dfc11533

Porting Notes:
* The CLI long option arguments for '-t' and '-m' don't parse on linux
* Switched from kmem_alloc to vmem_alloc in zcp_lua_alloc
* Lua implementation is built as its own module (zlua.ko)
* Lua headers consumed directly by zfs code moved to 'include/sys/lua/'
* There is no native setjmp/longjump available in stock Linux kernel.
  Brought over implementations from illumos and FreeBSD
* The get_temporary_prop() was adapted due to VFS platform differences
* Use of inline functions in lua parser to reduce stack usage per C call
* Skip some ZFS Test Suite ZCP tests on sparc64 to avoid stack overflow
2018-02-08 15:28:18 -08:00
WHR 8824a7f133 OpenZFS 8966 - Source file zfs_acl.c, function zfs_aclset_common contains a use after end of the lifetime of a local variable
Authored by: WHR <msl0000023508@gmail.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8966
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c95549fcdc
Closes #7141
2018-02-08 10:33:45 -08:00
Matthew Thode 6f259b59cf Only run pre-mount hook zfs-load-key on systemd
Reviewed-by: Kash Pande <kash@tripleback.net>
Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Thode <mthode@mthode.org>
Closes #7136 
Closes #7140
2018-02-07 18:31:54 -08:00
LOLi 9ca25e709b Verify ZFS Test Suite scripts executability
This change adds a make target 'testscheck' which verifies all ksh test
scripts, part of the ZFS Test Suite, have their executable bit set:
this should avoid adding new test scripts that cannot be executed by
the test-runner.py which would result in the following warning message:

   Warning: Test removed from TestGroup because it failed verification.

This also verifies both *.cfg and *.kshlib files used in the ZFS Test
Suite are not executable.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7131
2018-02-07 12:43:24 -08:00
Richard Elling 6b810d04bd Remove deprecated zfs_arc_p_aggressive_disable
zfs_arc_p_aggressive_disable is no more. This PR removes docs
and module parameters for zfs_arc_p_aggressive_disable.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Richard Elling <Richard.Elling@RichardElling.com>
Closes #7135
2018-02-07 11:54:20 -08:00
Brian Behlendorf 48ef8ba070 Split spl-build.m4
Split the kernel interface configure checks in to seperate m4
macro files.  This is intended to facilitate moving the spl
source code in to the zfs repository.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #682
2018-02-07 11:50:24 -08:00
Brian Behlendorf 5461eefe50 Fix cstyle warnings
This patch contains no functional changes.  It is solely intended
to resolve cstyle warnings in order to facilitate moving the spl
source code in to the zfs repository.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #681
2018-02-07 11:49:38 -08:00
Brian Behlendorf 6d82b79699 Add zfs-load-key.sh to .gitignore
The generated zfs-load-key.sh file should have been added to
the .gitignore file as part of commit 7da8f8d8.  And the
generated file should not be included in the repo.

Reviewed-by: Matthew Thode <prometheanfire@gentoo.org>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7134
2018-02-06 16:39:18 -08:00
DeHackEd 0cf3430d2d Fix zpool status overflow on fast scrubs
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: DHE <git@dehacked.net>
Closes #7133
2018-02-06 16:35:16 -08:00
Brian Behlendorf 3d25488afb Fix default libdir for Debian/Ubuntu
The distribution provided architecture specific RPM macro files
for x86_64 and other architectures on Debian/Ubuntu specify the
wrong default libdir install location.  When building deb packages
override _lib with the correct location.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7083 
Closes #7101
2018-02-05 20:42:52 -08:00
Tom Caputi 2b84817f66 Adjust ARC prefetch tunables to match docs
Currently, the ARC exposes 2 tunables (zfs_arc_min_prefetch_ms
and zfs_arc_min_prescient_prefetch_ms) which are documented
to be specified in milliseconds. However, the code actually
uses the values as though they were in seconds. This patch
adjusts the code to match the names and documentation of the
tunables.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7126
2018-02-05 16:57:53 -08:00
Brian Behlendorf 60b8207496 Set persistent ztest failure mode
In order to reliably detect deadlocks in the create and import
path ztest should set the failure mode property.  This ensures
that the pool is always using the correct failmode behavior.

Removed insidious use of local variable in MAXFAULTS macro.

Converted VERIFY() to VERIFY0() where appropriate.

Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7111
2018-02-05 12:00:26 -08:00
wli5 f6b58faaa6 Bug fix in qat_compress.c for vmalloc addr check
Remove the unused vmalloc address check, and function mem_to_page
will handle the non-vmalloc address when map it to a physical
address.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Closes #7125
2018-02-05 10:26:27 -08:00
Brian Behlendorf 0d23f5e2e4 Fix hash_lock / keystore.sk_dk_lock lock inversion
The keystore.sk_dk_lock should not be held while performing I/O.
Drop the lock when reading from disk and update the code so
they the first successful caller adds the key.

Improve error handling in spa_keystore_create_mapping_impl().

Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: RageLtMan <rageltman@sempervictus>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7112 
Closes #7115
2018-02-04 14:07:13 -08:00
LOLi fbd4254268 Fix systemd_ RPM macros usage on Debian-based distributions
Debian-based distributions do not seem to provide RPM macros for
dealing with systemd pre- and post- (un)install actions: this results
in errors when installing or upgrading .deb packages because the
resulting control scripts contain the following unresolved macros:

 * %systemd_post
 * %systemd_preun
 * %systemd_postun

Fix this by providing default values for postinstall, preuninstall and
postuninstall scripts when these macros are not defined.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7074 
Closes #7100
2018-02-02 13:50:42 -08:00
Tom Caputi 1b66810bad Change os->os_next_write_raw to work per txg
Currently, os_next_write_raw is a single boolean used for determining
whether or not the next call to dmu_objset_sync() should write out
the objset_phys_t as a raw buffer. Since the boolean is not associated
with a txg, the work simply happens during the next txg, which is not
necessarily the correct one. In the current implementation this issue
was misdiagnosed, resulting in a small hack in dmu_objset_sync() which
seemed to resolve the problem.

This patch changes os_next_write_raw to be an array of booleans, one
for each txg in TXG_OFF and removes the hack.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6864
2018-02-02 11:44:53 -08:00
Tom Caputi 047116ac76 Raw sends must be able to decrease nlevels
Currently, when a raw zfs send file includes a DRR_OBJECT record
that would decrease the number of levels of an existing object,
the object is reallocated with dmu_object_reclaim() which
creates the new dnode using the old object's nlevels. For non-raw
sends this doesn't really matter, but raw sends require that
nlevels on the receive side match that of the send side so that
the checksum-of-MAC tree can be properly maintained. This patch
corrects the issue by freeing the object completely before
allocating it again in this case.

This patch also corrects several issues with dnode_hold_impl()
and related functions that prevented dnodes (particularly
multi-slot dnodes) from being reallocated properly due to
the fact that existing dnodes were not being fully cleaned up
when they were freed.

This patch adds a test to make sure that zfs recv functions
properly with incremental streams containing dnodes of different
sizes.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6821
Closes #6864
2018-02-02 11:43:11 -08:00
Tom Caputi d53bd7f524 Fix recovery import (-F) with encrypted pool
When performing zil_claim() at pool import time, it is
important that encrypted datasets set os_next_write_raw
before writing to the zil_header_t. This prevents the code
from attempting to re-authenticate the objset_phys_t when
it writes it out, which is unnecessary because the
zil_header_t is not protected by either objset MAC and
impossible since the keys aren't loaded yet. Unfortunately,
one of the code paths did not set this flag, which causes
failed ASSERTs during 'zpool import -F'. This patch corrects
this issue.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6864
Closes #6916
2018-02-02 11:39:36 -08:00
Tom Caputi ae76f45cda Encryption Stability and On-Disk Format Fixes
The on-disk format for encrypted datasets protects not only
the encrypted and authenticated blocks themselves, but also
the order and interpretation of these blocks. In order to
make this work while maintaining the ability to do raw
sends, the indirect bps maintain a secure checksum of all
the MACs in the block below it along with a few other
fields that determine how the data is interpreted.

Unfortunately, the current on-disk format erroneously
includes some fields which are not portable and thus cannot
support raw sends. It is not possible to easily work around
this issue due to a separate and much smaller bug which
causes indirect blocks for encrypted dnodes to not be
compressed, which conflicts with the previous bug. In
addition, the current code generates incompatible on-disk
formats on big endian and little endian systems due to an
issue with how block pointers are authenticated. Finally,
raw send streams do not currently include dn_maxblkid when
sending both the metadnode and normal dnodes which are
needed in order to ensure that we are correctly maintaining
the portable objset MAC.

This patch zero's out the offending fields when computing
the bp MAC and ensures that these MACs are always
calculated in little endian order (regardless of the host
system's byte order). This patch also registers an errata
for the old on-disk format, which we detect by adding a
"version" field to newly created DSL Crypto Keys. We allow
datasets without a version (version 0) to only be mounted
for read so that they can easily be migrated. We also now
include dn_maxblkid in raw send streams to ensure the MAC
can be maintained correctly.

This patch also contains minor bug fixes and cleanups.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6845
Closes #6864
Closes #7052
2018-02-02 11:37:16 -08:00
Dr. András Korn 4c46b99d24 tx_waited -> tx_dirty_delayed in trace_dmu.h
This change was missed in 0735ecb334.

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: András Korn <korn-github.com@elan.rulez.org>
Closes #7096
2018-01-31 16:13:26 -08:00
Tom Caputi a73c94934f Change movaps to movups in AES-NI code
Currently, the ICP contains accelerated assembly code to be
used specifically on CPUs with AES-NI enabled. This code
makes heavy use of the movaps instruction which assumes that
it will be provided aes keys that are 16 byte aligned. This
assumption seems to hold on Illumos, but on Linux some kernel
options such as 'slub_debug=P' will violate it. This patch
changes all instances of this instruction to movups which is
the same except that it can handle unaligned memory.

This patch also adds a few flags which were accidentally never
given to the assembly compiler, resulting in objtool warnings.

Reviewed by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Nathaniel R. Lewis <linux.robotdude@gmail.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #7065 
Closes #7108
2018-01-31 15:17:56 -08:00
Brian Behlendorf f90a30ad1b Fix txg_sync_thread hang in scan_exec_io()
When scn->scn_maxinflight_bytes has not been initialized it's
possible to hang on the condition variable in scan_exec_io().
This issue was uncovered by ztest and is only possible when
deduplication is enabled through the following call path.

  txg_sync_thread()
    spa_sync()
      ddt_sync_table()
        ddt_sync_entry()
          dsl_scan_ddt_entry()
            dsl_scan_scrub_cb()
              dsl_scan_enqueuei()
                scan_exec_io()
                  cv_wait()

Resolve the issue by always initializing scn_maxinflight_bytes
to a reasonable minimum value.  This value will be recalculated
in dsl_scan_sync() to pick up changes to zfs_scan_vdev_limit
and the addition/removal of vdevs.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7098
2018-01-31 09:33:33 -08:00
Matthew Thode 1d8a71b603 remove pools without a bootfs from BOOTFS variable
Use the same method used in zfs-load-key.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Llewelyn Trahaearn <WoefulDerelict@GMail.com>
Signed-off-by: Matthew Thode <mthode@mthode.org>
Closes #7089
2018-01-30 15:58:19 -08:00
LOLi bee7e4ff12 Fix 'zfs receive -o' when used with '-e|-d'
When used in conjunction with one of '-e' or '-d' zfs receive options
none of the properties requested to be set (-o) are actually applied:
this is caused by a wrong assumption made about the toplevel dataset
in zfs_receive_one().

Fix this by correctly detecting the toplevel dataset.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7088
2018-01-30 15:54:33 -08:00
bunder2015 405ec516ab Fix zpool-features(5) large_block inconsistency
Large_blocks feature activation was not consistent with man page,
which erroneously stated that the feature was active when the
recordsize was increased past the stock 128KB.  It actually
becomes active when data is written to the dataset.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #6275 
Closes #7093
2018-01-29 15:10:32 -08:00
LOLi 63f88c12b4 Fix style issues in man pages and commands help
* Remove 'zfs snap' from zfs help message (OpenZFS sync)
* Update zfs(8) to suggest 'snap' can be used as an alias for 'snapshot'
* Enforce 80 columns limit in help messages
* Remove zfs_disable_dup_eviction from zfs-module-parameters(5)
* Expose zfs_scan_max_ext_gap as a kernel module parameter.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7087
2018-01-29 15:05:03 -08:00
Giuseppe Di Natale 5e021f56d3 Add dbuf hash and dbuf cache kstats
Introduce kstats about the dbuf hash and dbuf cache
to make it easier to inspect state. This should help
with debugging and understanding of these portions
of the codebase.

Correct format of dbuf kstat file.

Introduce a dbc column to dbufs kstat to indicate if
a dbuf is in the dbuf cache.

Introduce field filtering in the dbufstat python script.

Introduce a no header option to the dbufstat python script.

Introduce a test case to test basic mru->mfu list movement
in the ARC.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6906
2018-01-29 10:24:52 -08:00
Prakash Surya 0735ecb334 OpenZFS 8997 - ztest assertion failure in zil_lwb_write_issue
PROBLEM
=======

When `dmu_tx_assign` is called from `zil_lwb_write_issue`, it's possible
for either `ERESTART` or `EIO` to be returned.

If `ERESTART` is returned, this will cause an assertion to fail directly
in `zil_lwb_write_issue`, where the code assumes the return value is
`EIO` if `dmu_tx_assign` returns a non-zero value. This can occur if the
SPA is suspended when `dmu_tx_assign` is called, and most often occurs
when running `zloop`.

If `EIO` is returned, this can cause assertions to fail elsewhere in the
ZIL code. For example, `zil_commit_waiter_timeout` contains the
following logic:

    lwb_t *nlwb = zil_lwb_write_issue(zilog, lwb);
    ASSERT3S(lwb->lwb_state, !=, LWB_STATE_OPENED);

In this case, if `dmu_tx_assign` returned `EIO` from within
`zil_lwb_write_issue`, the `lwb` variable passed in will not be issued
to disk. Thus, it's `lwb_state` field will remain `LWB_STATE_OPENED` and
this assertion will fail. `zil_commit_waiter_timeout` assumes that after
it calls `zil_lwb_write_issue`, the `lwb` will be issued to disk, and
doesn't handle the case where this is not true; i.e. it doesn't handle
the case where `dmu_tx_assign` returns `EIO`.

SOLUTION
========

This change modifies the `dmu_tx_assign` function such that `txg_how` is
a bitmask, rather than of the `txg_how_t` enum type. Now, the previous
`TXG_WAITED` semantics can be used via `TXG_NOTHROTTLE`, along with
specifying either `TXG_NOWAIT` or `TXG_WAIT` semantics.

Previously, when `TXG_WAITED` was specified, `TXG_NOWAIT` semantics was
automatically invoked. This was not ideal when using `TXG_WAITED` within
`zil_lwb_write_issued`, leading the problem described above. Rather, we
want to achieve the semantics of `TXG_WAIT`, while also preventing the
`tx` from being penalized via the dirty delay throttling.

With this change, `zil_lwb_write_issued` can acheive the semtantics that
it requires by passing in the value `TXG_WAIT | TXG_NOTHROTTLE` to
`dmu_tx_assign`.

Further, consumers of `dmu_tx_assign` wishing to achieve the old
`TXG_WAITED` semantics can pass in the value `TXG_NOWAIT | TXG_NOTHROTTLE`.

Authored by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:
- Additionally updated `zfs_tmpfile` to use `TXG_NOTHROTTLE`

OpenZFS-issue: https://www.illumos.org/issues/8997
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/19ea6cb0f9
Closes #7084
2018-01-26 20:19:46 -08:00
Chunwei Chen 522db29275 zpool import -d to specify device path
When we know which devices have the pool we are looking for, sometime
it's better if we can directly pass those device paths to zpool import
instead of letting it to search through all unrelated stuff, which might
take a lot of time if you have hundreds of disks.

This patch allows option -d <dev_path> to zpool import. You can have
multiple pairs of -d <dev_path>, and zpool import will only search
through those devices. For example:

    zpool import -d /dev/sda -d /dev/sdb

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7077
2018-01-26 10:49:46 -08:00
Brian Behlendorf f55a8757a6 Update README.initramfs.markdown
Fix several typos and grammar.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Arno van Wyk <avw1987@users.noreply.github.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7080
2018-01-26 09:55:16 -08:00
Brian Behlendorf cd0a89ded9 Extend zloop.sh for automated testing
In order to debug issues encountered by ztest during automated
testing it's important that as much debugging information as
possible by dumped at the time of the failure.  The following
changes extend the zloop.sh script in order to make it easier
to integrate with buildbot.

* Add the `-m <maximum cores>` option to zloop.sh to place a
  limit of the number of core dumps generated.  By default, the
  existing behavior is maintained and no limit is set.

* Add the `-l` option to create a 'ztest.core.N' symlink in the
  current directory to the core directory. This functionality
  is provided primarily for buildbot which expects log files to
  have well known names.

* Rename 'ztest.ddt' to 'ztest.zdb' and extend it to dump
  additional basic information on failure for latter analysis.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6999
2018-01-25 13:42:34 -08:00
Brian Behlendorf bb25362553 Prevent zdb(8) from occasionally hanging on I/O
The zdb(8) command may not terminate in the case where the pool
gets suspended and there is a caller in zio_wait() blocking on
an outstanding read I/O that will never complete.  This can in
turn cause ztest(1) to block indefinitely despite the deadman.

Resolve the issue by setting the default failure mode for zdb(8)
to panic.  In user space we always want the command to terminate
when forward progress is no longer possible.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6999
2018-01-25 13:41:51 -08:00
Brian Behlendorf 8fb1ede146 Extend deadman logic
The intent of this patch is extend the existing deadman code
such that it's flexible enough to be used by both ztest and
on production systems.  The proposed changes include:

* Added a new `zfs_deadman_failmode` module option which is
  used to dynamically control the behavior of the deadman.  It's
  loosely modeled after, but independant from, the pool failmode
  property.  It can be set to wait, continue, or panic.

    * wait     - Wait for the "hung" I/O (default)
    * continue - Attempt to recover from a "hung" I/O
    * panic    - Panic the system

* Added a new `zfs_deadman_ziotime_ms` module option which is
  analogous to `zfs_deadman_synctime_ms` except instead of
  applying to a pool TXG sync it applies to zio_wait().  A
  default value of 300s is used to define a "hung" zio.

* The ztest deadman thread has been re-enabled by default,
  aligned with the upstream OpenZFS code, and then extended
  to terminate the process when it takes significantly longer
  to complete than expected.

* The -G option was added to ztest to print the internal debug
  log when a fatal error is encountered.  This same option was
  previously added to zdb in commit fa603f82.  Update zloop.sh
  to unconditionally pass -G to obtain additional debugging.

* The FM_EREPORT_ZFS_DELAY event which was previously posted
  when the deadman detect a "hung" pool has been replaced by
  a new dedicated FM_EREPORT_ZFS_DEADMAN event.

* The proposed recovery logic attempts to restart a "hung"
  zio by calling zio_interrupt() on any outstanding leaf zios.
  We may want to further restrict this to zios in either the
  ZIO_STAGE_VDEV_IO_START or ZIO_STAGE_VDEV_IO_DONE stages.
  Calling zio_interrupt() is expected to only be useful for
  cases when an IO has been submitted to the physical device
  but for some reasonable the completion callback hasn't been
  called by the lower layers.  This shouldn't be possible but
  has been observed and may be caused by kernel/driver bugs.

* The 'zfs_deadman_synctime_ms' default value was reduced from
  1000s to 600s.

* Depending on how ztest fails there may be no cache file to
  move.  This should not be considered fatal, collect the logs
  which are available and carry on.

* Add deadman test cases for spa_deadman() and zio_wait().

* Increase default zfs_deadman_checktime_ms to 60s.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6999
2018-01-25 13:40:38 -08:00
Andriy Gapon 1b18c6d791 OpenZFS 8731 - ASSERT3U(nui64s, <=, UINT16_MAX) fails for large blocks
Authored by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8731
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4c08500788
Closes #7079
2018-01-25 10:02:11 -08:00
Giuseppe Di Natale cf232b53d5 Revert "Remove wrong ASSERT in annotate_ecksum"
This reverts commit 093911f194.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #7079
2018-01-25 10:01:02 -08:00
Brian Behlendorf 23602fdb39 Add cv_timedwait_io()
Add missing helper function cv_timedwait_io(), it should be used
when waiting on IO with a specified timeout.

Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #674
2018-01-24 11:33:47 -08:00
Allan Jude 6bc4a2376c OpenZFS 8972 - zfs holds: In scripted mode, do not pad columns with spaces
Authored by: Allan Jude <allanjude@freebsd.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8972
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3aace5c077
Closes #7063
2018-01-19 09:36:17 -08:00
Alexander Motin 916384729e OpenZFS 8835 - Speculative prefetch in ZFS not working for misaligned reads
In case of misaligned I/O sequential requests are not detected as such
due to overlaps in logical block sequence:

    dmu_zfetch(fffff80198dd0ae0, 27347, 9, 1)
    dmu_zfetch(fffff80198dd0ae0, 27355, 9, 1)
    dmu_zfetch(fffff80198dd0ae0, 27363, 9, 1)
    dmu_zfetch(fffff80198dd0ae0, 27371, 9, 1)
    dmu_zfetch(fffff80198dd0ae0, 27379, 9, 1)
    dmu_zfetch(fffff80198dd0ae0, 27387, 9, 1)

This patch makes single block overlap to be counted as a stream hit,
improving performance up to several times.

Authored by: Alexander Motin <mav@FreeBSD.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8835
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aab6dd482a
Closes #7062
2018-01-19 09:31:29 -08:00
Brian Behlendorf 31864e3d8c OpenZFS 8652 - Tautological comparisons with ZPROP_INVAL
usr/src/uts/common/sys/fs/zfs.h
	Change ZPROP_INVAL and ZPROP_CONT from macros to enum values.  Clang
	and GCC both prefer to use unsigned ints to store enums.  That was
	causing tautological comparison warnings (and likely eliminating
	error handling code at compile time) whenever a zfs_prop_t or
	zpool_prop_t was compared to ZPROP_INVAL or ZPROP_CONT.  Making the
	error flags be explicity enum values forces the enum types to be
	signed.

	ZPROP_INVAL was also compared against two different enum types.  I
	had to change its name to ZPOOL_PROP_INVAL whenever its compared to
	a zpool_prop_t.  There are still some places where ZPROP_INVAL or
	ZPROP_CONT is compared to a plain int, in code that doesn't know
	whether the int is storing a zfs_prop_t or a zpool_prop_t.

usr/src/uts/common/fs/zfs/spa.c
	s/ZPROP_INVAL/ZPOOL_PROP_INVAL/

Authored by: Alan Somers <asomers@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: George Melikov <mail@gmelikov.ru>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8652
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c2de80dc74
Closes #7061
2018-01-19 09:22:37 -08:00
Brian Behlendorf 1574c73bd0 OpenZFS 8641 - "zpool clear" and "zinject" don't work on "spare" or "replacing" vdevs
Add "spare" and "replacing" to the list of interior vdev types in
zpool_vdev_is_interior(), alongside the existing "mirror" and "raidz".
This fixes running "zinject -d" and "zpool clear" on spare and replacing
vdevs.

Authored by: Alan Somers <asomers@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8641
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9a36801382
Closes #7060
2018-01-19 09:20:58 -08:00
Matthew Thode 7da8f8d81b Run zfs load-key if needed in dracut
'zfs load-key -a' will only be called if needed.  If a dataset not
needed for boot does not have its key loaded (home directories for
example) boot can still continue.

zfs:AUTO was not working via dracut, so we still need the generator
script to do its thing.

Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Manuel Amador (Rudd-O) <rudd-o@rudd-o.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Matthew Thode <mthode@mthode.org>
Closes #6982 
Closes #7004
2018-01-18 10:20:34 -08:00
LOLi 79c3270476 Fix Debian packaging on ARMv7/ARM64
When building packages on Debian-based systems specify the target
architecture used by 'alien' to convert .rpm packages into .deb: this
avoids detecting an incorrect value which results in the following
errors:

<package>.aarch64.rpm is for architecture aarch64 ; the package cannot be built on this system
<package>.armv7l.rpm is for architecture armel ; the package cannot be built on this system

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #7046 
Closes #7058
2018-01-18 10:15:41 -08:00
LOLi fb79036f28 Fix Debian packaging on ARMv7/ARM64
When building packages on Debian-based systems specify the target
architecture used by 'alien' to convert .rpm packages into .deb: this
avoids detecting an incorrect value which results in the following
errors:

<package>.aarch64.rpm is for architecture aarch64 ; the package cannot be built on this system
<package>.armv7l.rpm is for architecture armel ; the package cannot be built on this system

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes zfsonlinux/zfs#7046 
Closes #678
2018-01-18 10:14:18 -08:00
John L. Hammond 51d1b58ef3 Emit an error message before MMP suspends pool
In mmp_thread(), emit an MMP specific error message before calling
zio_suspend() so that the administrator will understand why the pool
is being suspended.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John L. Hammond <john.hammond@intel.com>
Closes #7048
2018-01-17 12:24:42 -08:00
Sean Eric Fagan 43cb30b3ce OpenZFS 8959 - Add notifications when a scrub is paused or resumed
Authored by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

Porting Notes:
- Brought #defines in eventdefs.h in line with ZFS on Linux format.
- Updated zfs-events.5 with the new events.

OpenZFS-issue: https://www.illumos.org/issues/8959
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c862b93eea
Closes #7049
2018-01-17 10:31:00 -08:00
Brian Behlendorf 3da3488e63 Fix shellcheck v0.4.6 warnings
Resolve new warnings reported after upgrading to shellcheck
version 0.4.6.  This patch contains no functional changes.

* egrep is non-standard and deprecated. Use grep -E instead. [SC2196]
* Check exit code directly with e.g. 'if mycmd;', not indirectly
  with $?.  [SC2181]  Suppressed.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7040
2018-01-17 10:17:16 -08:00
DeHackEd d658b2caa9 Remove l2arc_nocompress from zfs-module-parameters(5)
Parameter was removed in d3c2ae1c08
(OpenZFS 6950 - ARC should cache compressed data)

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #7043
2018-01-16 10:18:08 -08:00
Matthew Thode c10cdcb55f Fix copy-builtin to work with ASAN patch
Commit fed90353 didn't fully update the copy-builtin script
as needed to perform in-kernel builds.  Add the missing
options and flags.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Thode <mthode@mthode.org>
Closes #7033 
Closes #7037
2018-01-12 09:39:36 -08:00
Brian Behlendorf e1a0850c35 Force ztest to always use /dev/urandom
For ztest, which is solely for testing, using a pseudo random
is entirely reasonable.  Using /dev/urandom ensures the system
entropy pool doesn't get depleted thus stalling the testing.
This is a particular problem when testing in VMs.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7017 
Closes #7036
2018-01-12 09:36:26 -08:00
Yuri Pankov 6df9f8ebd7 OpenZFS 8899 - zpool list property documentation doesn't match actual behaviour
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Alexander Pyhalov <alp@rsu.ru>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8899
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b0e142e57d
Closes #7032
2018-01-11 13:54:34 -08:00
Yuri Pankov bcb1a8a25e OpenZFS 8898 - creating fs with checksum=skein on the boot pools fails ungracefully
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8898
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9fa2266d9a
Closes #7031
2018-01-11 13:53:04 -08:00
Yuri Pankov 8198c57b21 OpenZFS 8897 - zpool online -e fails assertion when run on non-leaf vdevs
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8897
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9a551dd645
Closes #7030
2018-01-11 13:52:03 -08:00
Andriy Gapon 6a2185660d OpenZFS 8930 - zfs_zinactive: do not remove the node if the filesystem is readonly
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8930
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/93c618e0f4
Closes #7029
2018-01-11 13:50:08 -08:00
Brian Behlendorf fed90353d7 Support -fsanitize=address with --enable-asan
When --enable-asan is provided to configure then build all user
space components with fsanitize=address.  For kernel support
use the Linux KASAN feature instead.

https://github.com/google/sanitizers/wiki/AddressSanitizer

When using gcc version 4.8 any test case which intentionally
generates a core dump will fail when using --enable-asan.
The default behavior is to disable core dumps and only newer
versions allow this behavior to be controled at run time with
the ASAN_OPTIONS environment variable.

Additionally, this patch includes some build system cleanup.

* Rules.am updated to set the minimum AM_CFLAGS, AM_CPPFLAGS,
  and AM_LDFLAGS.  Any additional flags should be added on a
  per-Makefile basic.  The --enable-debug and --enable-asan
  options apply to all user space binaries and libraries.

* Compiler checks consolidated in always-compiler-options.m4
  and renamed for consistency.

* -fstack-check compiler flag was removed, this functionality
  is provided by asan when configured with --enable-asan.

* Split DEBUG_CFLAGS in to DEBUG_CFLAGS, DEBUG_CPPFLAGS, and
  DEBUG_LDFLAGS.

* Moved default kernel build flags in to module/Makefile.in and
  split in to ZFS_MODULE_CFLAGS and ZFS_MODULE_CPPFLAGS.  These
  flags are set with the standard ccflags-y kbuild mechanism.

* -Wframe-larger-than checks applied only to binaries or
  libraries which include source files which are built in
  both user space and kernel space.  This restriction is
  relaxed for user space only utilities.

* -Wno-unused-but-set-variable applied only to libzfs and
  libzpool.  The remaining warnings are the result of an
  ASSERT using a variable when is always declared.

* -D_POSIX_PTHREAD_SEMANTICS and -D__EXTENSIONS__ dropped
  because they are Solaris specific and thus not needed.

* Ensure $GDB is defined as gdb by default in zloop.sh.

Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7027
2018-01-10 10:49:27 -08:00
Brian Behlendorf 7e7f513277 Disable history_004_pos
Occasionally observed failure of history_004_pos due to the test
case not being 100% reliable.  In order to prevent false positives
disable this test case until it can be made reliable.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7026 
Closes #7028
2018-01-10 10:41:30 -08:00
Richard Yao 1d53657bf5 Fix incompatibility with Reiser4 patched kernels
In ZFSOnLinux, our sources and build system are self contained such that
we do not need to make changes to the Linux kernel sources. Reiser4 on
the other hand exists solely as a kernel tree patch and opts to make
changes to the kernel rather than adapt to it. After Linux 4.1 made a
VFS change that replaced new_sync_read with do_sync_read, Reiser4's
maintainer decided to modify the kernel VFS to export the old function.
This caused our autotools check to misidentify the kernel API as
predating Linux 4.1 on kernels that have been patched with Reiser4
support, which breaks our build.

Reiser4 really should be patched to stop doing this, but lets modify our
check to be more strict to help the affected users of both filesystems.

Also, we were not checking the types of arguments and return value of
new_sync_read() and new_sync_write() . Lets fix that too.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #6241 
Closes #7021
2018-01-09 16:18:19 -08:00
Alex Zhuravlev 3910184d9e Use zap_count instead of cached z_size for unlink
As a performance optimization Lustre does not strictly update
the SA_ZPL_SIZE when adding/removing from non-directory entries.
This results in entries which cannot be removed through the ZPL
layer even though the ZAP is empty and safe to remove.

Resolve this issue by checking the zap_count() directly instead
on relying on the cached SA_ZPL_SIZE.  Micro-benchmarks show no
significant performance impact due to the additional overhead
of using zap_count().

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #7019
2018-01-09 16:16:07 -08:00
Nathaniel Wesley Filardo cba6fc61a2 Revert raidz_map and _col structure types
As part of the refactoring of ab9f4b0b82,
several uint64_t-s and uint8_t-s were changed to other types.  This
caused ZoL github issue #6981, an overflow of a size_t on a 32-bit ARM
machine.  In absense of any strong motivation for the type changes, this
simply puts them back, modulo the changes accumulated for ABD.

Compile-tested on amd64 and run-tested on armhf.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu>
Closes #6981 
Closes #7023
2018-01-09 14:46:52 -08:00
Brian Behlendorf bfe27ace0d Fix unused variable warnings
Resolved unused variable warnings observed after restricting
-Wno-unused-but-set-variable to only libzfs and libzpool.

Reviewed-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6941
2018-01-09 12:28:03 -08:00
Brian Behlendorf 06401e4222 Fix ztest_verify_dnode_bt() test case
In ztest_verify_dnode_bt the ztest_object_lock must be held in
order to safely verify the unused bonus space.

Reviewed-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6941
2018-01-09 12:27:12 -08:00
DHE 460f239e69 Fix -fsanitize=address memory leak
kmem_alloc(0, ...) in userspace returns a leakable pointer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue #6941
2018-01-09 12:26:25 -08:00
George Amanakis be54a13c3e Fix percentage styling in zfs-module-parameters.5
Replace "percent" with "%", add bold to default values.

Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #7018
2018-01-09 11:51:11 -08:00
Brian Behlendorf b02becaa00 Reduce codecov PR comments
Attempt to reduce the number of comments posted by codecov
to PR requests.  Based on the codecov documenation setting
"require_changes=yes" and "behavior=once" should result in
a single comment under most circumstances.

https://docs.codecov.io/v4.3.6/docs/pull-request-comments

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #7022 
Closes #7025
2018-01-09 11:15:55 -08:00
Nathaniel Wesley Filardo 8b20a9f996 zhack: fix getopt return type
This fixes zhack's command processing on ARM.  On ARM char
is unsigned, and so, in promotion to an int, it will never
compare equal to -1.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nathaniel Wesley Filardo <nwf@cs.jhu.edu>
Closes #7016
2018-01-09 11:14:45 -08:00
Brian Behlendorf 0873bb6337 Fix ARC hit rate
When the compressed ARC feature was added in commit d3c2ae1
the method of reference counting in the ARC was modified.  As
part of this accounting change the arc_buf_add_ref() function
was removed entirely.

This would have be fine but the arc_buf_add_ref() function
served a second undocumented purpose of updating the ARC access
information when taking a hold on a dbuf.  Without this logic
in place a cached dbuf would not migrate its associated
arc_buf_hdr_t to the MFU list.  This would negatively impact
the ARC hit rate, particularly on systems with a small ARC.

This change reinstates the missing call to arc_access() from
dbuf_hold() by implementing a new arc_buf_access() function.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6171 
Closes #6852 
Closes #6989
2018-01-08 09:52:36 -08:00
LOLi 390d679acd Fix 'zpool add' handling of nested interior VDEVs
When replacing a faulted device which was previously handled by a spare
multiple levels of nested interior VDEVs will be present in the pool
configuration; the following example illustrates one of the possible
situations:

   NAME                          STATE     READ WRITE CKSUM
   testpool                      DEGRADED     0     0     0
     raidz1-0                    DEGRADED     0     0     0
       spare-0                   DEGRADED     0     0     0
         replacing-0             DEGRADED     0     0     0
           /var/tmp/fault-dev    UNAVAIL      0     0     0  cannot open
           /var/tmp/replace-dev  ONLINE       0     0     0
         /var/tmp/spare-dev1     ONLINE       0     0     0
       /var/tmp/safe-dev         ONLINE       0     0     0
   spares
     /var/tmp/spare-dev1         INUSE     currently in use

This is safe and allowed, but get_replication() needs to handle this
situation gracefully to let zpool add new devices to the pool.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6678 
Closes #6996
2017-12-28 10:15:32 -08:00
Prakash Surya 2fe61a7ecc OpenZFS 8909 - 8585 can cause a use-after-free kernel panic
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>

PROBLEM
=======

There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.

Here's an example panic due to this bug:

    > ::status
    debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
    operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
    image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
    panic message:
    BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference
    dump content: kernel pages only

    > $c
    zio_shrink+0x12()
    zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
    zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
    zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
    zil_commit+0x80(ffffff03dcd15cc0, 9a9)
    zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
    fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
    write+0x250(42, fffffd7ff4832000, 2000)
    sys_syscall+0x177()

If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.

A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.

This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).

Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.

This race *is* present in `zil_suspend`, leading to this bug.

At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.

This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnecessary.

So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.

SOLUTION
========

The solution to this, is to ensure all outstanding lwb's complete before
calling `zil_free_lwb` via `zil_destroy` in `zil_suspend`. This patch
accomplishes this goal by forcing the normal `zil_commit` logic when
called from `zil_sync`.

Now, `zil_suspend` will call `zil_commit_impl` which will always use the
normal logic of waiting/issuing lwb's to disk before it returns. As a
result, any lwb's outstanding when `zil_commit_impl` is called will be
guaranteed to reach the `LWB_STATE_DONE` state by the time it returns.

Further, no new lwb's will be created via `zil_commit` since the zilog's
`zl_suspend` flag will be set. This will force all new callers of
`zil_commit` to use `txg_wait_synced` instead of creating and issuing
new lwb's.

Thus, all lwb's left on the zilog's lwb list when `zil_destroy` is
called will be in the `LWB_STATE_DONE` state, and we'll avoid this race
condition.

OpenZFS-issue: https://www.illumos.org/issues/8909
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ece62b6f8d
Closes #6940
2017-12-28 10:18:04 -08:00
lidongyang 823d48bfb1 Call commit callbacks from the tail of the list
Our zfs backed Lustre MDT had soft lockups while under heavy metadata
workloads while handling transaction callbacks from osd_zfs.

The problem is zfs is not taking advantage of the fast path in
Lustre's trans callback handling, where Lustre will skip the calls
to ptlrpc_commit_replies() when it already saw a higher transaction
number.

This patch corrects this, it also has a positive impact on metadata
performance on Lustre with osd_zfs, plus some cleanup in the headers.

A similar issue for ext4/ldiskfs is described on:
https://jira.hpdd.intel.com/browse/LU-6527

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au>
Closes #6986
2017-12-22 10:19:51 -08:00
Tom Caputi 44b61ea506 Remove empty files accidentally added by a8b2e306
This patch simply removes 2 empty files that were accidentally
added a part of the scrub priority patch.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6990
2017-12-22 10:17:48 -08:00
Tony Hutter c9821f1ccc Linux 4.15 compat: timer updates
Use timer_setup() macro and new timeout function definition.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #670
Closes #671
2017-12-21 10:56:32 -08:00
Tom Caputi a8b2e30685 Support re-prioritizing asynchronous prefetches
When sequential scrubs were merged, all calls to arc_read()
(including prefetch IOs) were given ZIO_PRIORITY_ASYNC_READ.
Unfortunately, this behaves badly with an existing issue where
prefetch IOs cannot be re-prioritized after the issue. The
result is that synchronous reads end up in the same vdev_queue
as the scrub IOs and can have (in some workloads) multiple
seconds of latency.

This patch incorporates 2 changes. The first ensures that all
scrub IOs are given ZIO_PRIORITY_SCRUB to allow the vdev_queue
code to differentiate between these I/Os and user prefetches.
Second, this patch introduces zio_change_priority() to provide
the missing capability to upgrade a zio's priority.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6921 
Closes #6926
2017-12-21 09:13:06 -08:00
Simon Guest 993669a7bf vdev_id: new slot type ses
This extends vdev_id to support a new slot type, ses, for SCSI Enclosure
Services.  With slot type ses, the disk slot numbers are determined by
using the device slot number reported by sg_ses for the device with
matching SAS address, found by querying all available enclosures.

This is primarily of use on systems with a deficient driver omitting
support for bay_identifier in /sys/devices.  In my testing, I found that
the existing slot types of port and id were not stable across disk
replacement, so an alternative was required.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Simon Guest <simon.guest@tesujimath.org>
Closes #6956
2017-12-20 09:42:07 -08:00
Giuseppe Di Natale 89a66a0457 Handle broken pipes in arc_summary
Using a command similar to 'arc_summary.py | head' causes
a broken pipe exception. Gracefully exit in the case of a
broken pipe in arc_summary.py.

Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6965 
Closes #6969
2017-12-19 13:19:24 -08:00
LOLi c4ba46dead Handle invalid options in arc_summary
If an invalid option is provided to arc_summary.py we handle any error
thrown from the getopt Python module and print the usage help message.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6983
2017-12-19 13:02:40 -08:00
Dominik Hassler 2e7c1bb35a OpenZFS 8794 - cstyle generates warnings with recent perl
Authored by: Dominik Hassler <hadfl@omniosce.org>
Reviewed by: Andy Fiddaman <andy@omniosce.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8794
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/578f67364c
Closes #6973
2017-12-19 12:54:08 -08:00
LOLi c30e34faa1 ZTS: Fix create-o_ashift test case
The function that fills the uberblock ring buffer on every device label
has been reworked to avoid occasional failures caused by a race
condition that prevents 'zpool sync' from writing some uberblock
sequentially: this happens when the pool sync ioctl dispatch code calls
txg_wait_synced() while we're already waiting for a TXG to sync.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6924 
Closes #6977
2017-12-19 10:49:33 -08:00
Brian Behlendorf bbffb59efc Fix multihost stale cache file import
When the multihost property is enabled it should be impossible to
import an active pool even using the force (-f) option.  This patch
prevents a forced import from succeeding when importing with a
stale cache file.

The root cause of the problem is that the kernel modules trusted
the hostid provided in configuration.  This is always correct when
the configuration is generated by scanning for the pool.  However,
when using an existing cache file the hostid could be stale which
would result in the activity check being skipped.

Resolve the issue by always using the hostid read from the label
configuration where the best uberblock was found.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6933 
Closes #6971
2017-12-18 10:28:27 -08:00
LOLi e2d936e0f8 Honor --with-mounthelperdir where applicable
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6962
2017-12-17 14:14:07 -08:00
LOLi ee410eefc2 Fix --with-systemd on Debian-based distributions (#6963)
These changes propagate the "--with-systemd" configure option to the
RPM spec file, allowing Debian-based distributions to package
systemd-related files.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6591 
Closes #6963
2017-12-17 14:08:48 -08:00
Brian Behlendorf 516c09d0d5 Remove lib/libspl/include/sys/frame.h
The functionality provided by this header is not required by any
of the ZFS user space code.  Minimal functionality was provided
in commit c28a677 which added include/sys/frame.h.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6960 
Closes #6972
2017-12-17 14:02:29 -08:00
David Qian f6940bb9ea Enable QAT support in zfs-dkms RPM
Enable QAT accelerated gzip compression in zfs-dkms RPM package when
environment variant ICP_ROOT is set to QAT drive source code folder
and QAT hardware presence.  Otherwise, use default gzip compression.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: David Qian <david.qian@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6932
2017-12-14 12:44:27 -05:00
Lalufu 9920950ccb Add zfs-import.target services in spec file
Add missing zfs-import.target to list of systemd services in zfs
RPM spec file.

Reviewed-by: Niklas Wagner <Skaro@Skaronator.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ralf Ertzinger <ralf@skytale.net>
Issue #6953 
Closes #6955
2017-12-13 17:09:22 -08:00
LOLi 4e9b156960 Various ZED fixes
* Teach ZED to handle spares usingi the configured ashift: if the zpool
   'ashift' property is set then ZED should use its value when kicking
   in a hotspare; with this change 512e disks can be used as spares
   for VDEVs that were created with ashift=9, even if ZFS natively
   detects them as 4K block devices.

 * Introduce an additional auto_spare test case which verifies that in
   the face of multiple device failures an appropiate number of spares
   are kicked in.

 * Fix zed_stop() in "libtest.shlib" which did not correctly wait the
   target pid.

 * Fix ZED crashing on startup caused by a race condition in libzfs
   when used in multi-threaded context.

 * Convert ZED over to using the tpool library which is already present
   in the Illumos FMA code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #2562 
Closes #6858
2017-12-08 16:58:41 -08:00
Brian Behlendorf 3ab3166347 Disable vdev_zaps_004_pos
Occasionally observed failure of vdev_zaps_004_pos due to the test
case not being 100% reliable.  In order to prevent false positives
disable this test case until it can be made reliable.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6935
Closes #6936
2017-12-07 16:43:59 -08:00
Brian Behlendorf c28a67733c Suppress incorrect objtool warnings
Suppress incorrect warnings from versions of objtool which are not
aware of x86 EVEX prefix instructions used for AVX512.

  module/zfs/vdev_raidz_math_avx512bw.o: warning:
  objtool: <func+offset>: can't find jump dest instruction at .text

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6928
2017-12-07 10:28:50 -08:00
Tony Hutter 674b89342e Fix segfault in zpool iostat when adding VDEVs
Fix a segfault when running 'zpool iostat -v 1' while adding
a VDEV.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #6748 
Closes #6872
2017-12-06 11:43:07 -08:00
Prakash Surya 1b2b0acab5 OpenZFS 8603 - rename zilog's "zl_writer_lock" to "zl_issuer_lock"
This is a purely cosmetic change. The zilog's "zl_writer_lock" field is
being renamed to "zl_issuer_lock" to try and make the code easier to
understand; no other changes are made.

Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: C Fraire <cfraire@me.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8603
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2daf06546b
Closes #6927
2017-12-06 11:38:10 -08:00
Brian Behlendorf 0c415a93d2 Disable create-o_ashift
Occasionally observed failure of create-o_ashift due to the test
case not being 100% reliable.  In order to prevent false positives
disable this test case until it can be made reliable.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6924
Closes #6925
2017-12-06 10:13:54 -08:00
Prakash Surya 1ce23dcaff OpenZFS 8585 - improve batching done in zil_commit()
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Prakash Surya <prakash.surya@delphix.com>

Problem
=======

The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:

 1. When there's outstanding ZIL blocks being written (i.e. there's
    already a "writer thread" in progress), then any new calls to
    zil_commit() will block waiting for the currently oustanding ZIL
    blocks to complete. The blocks written for each "writer thread" is
    coined a "batch", and there can only ever be a single "batch" being
    written at a time. When a batch is being written, any new ZIL
    transactions will have to wait for the next batch to be written,
    which won't occur until the current batch finishes.

    As a result, the underlying storage may not be used as efficiently
    as possible. While "new" threads enter zil_commit() and are blocked
    waiting for the next batch, it's possible that the underlying
    storage isn't fully utilized by the current batch of ZIL blocks. In
    that case, it'd be better to allow these new threads to generate
    (and issue) a new ZIL block, such that it could be serviced by the
    underlying storage concurrently with the other ZIL blocks that are
    being serviced.

 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
    to complete, prior to zil_commit() returning. The size of any given
    batch is proportional to the number of ZIL transaction in the queue
    at the time that the batch starts processing the queue; which
    doesn't occur until the previous batch completes. Thus, if there's a
    lot of transactions in the queue, the batch could be composed of
    many ZIL blocks, and each call to zil_commit() will have to wait for
    all of these writes to complete (even if the thread calling
    zil_commit() only cared about one of the transactions in the batch).

To further complicate the situation, these two issues result in the
following side effect:

 3. If a given batch takes longer to complete than normal, this results
    in larger batch sizes, which then take longer to complete and
    further drive up the latency of zil_commit(). This can occur for a
    number of reasons, including (but not limited to): transient changes
    in the workload, and storage latency irregularites.

Solution
========

The solution attempted by this change has the following goals:

 1. no on-disk changes; maintain current on-disk format.
 2. modify the "batch size" to be equal to the "ZIL block size".
 3. allow new batches to be generated and issued to disk, while there's
    already batches being serviced by the disk.
 4. allow zil_commit() to wait for as few ZIL blocks as possible.
 5. use as few ZIL blocks as possible, for the same amount of ZIL
    transactions, without introducing significant latency to any
    individual ZIL transaction. i.e. use fewer, but larger, ZIL blocks.

In theory, with these goals met, the new allgorithm will allow the
following improvements:

 1. new ZIL blocks can be generated and issued, while there's already
    oustanding ZIL blocks being serviced by the storage.
 2. the latency of zil_commit() should be proportional to the underlying
    storage latency, rather than the incoming synchronous workload.

Porting Notes
=============

Due to the changes made in commit 119a394ab0, the lifetime of an itx
structure differs than in OpenZFS. Specifically, the itx structure is
kept around until the data associated with the itx is considered to be
safe on disk; this is so that the itx's callback can be called after the
data is committed to stable storage. Since OpenZFS doesn't have this itx
callback mechanism, it's able to destroy the itx structure immediately
after the itx is committed to an lwb (before the lwb is written to
disk).

To support this difference, and to ensure the itx's callbacks can still
be called after the itx's data is on disk, a few changes had to be made:

  * A list of itxs was added to the lwb structure. This list contains
    all of the itxs that have been committed to the lwb, such that the
    callbacks for these itxs can be called from zil_lwb_flush_vdevs_done(),
    after the data for the itxs is committed to disk.

  * A list of itxs was added on the stack of the zil_process_commit_list()
    function; the "nolwb_itxs" list. In some circumstances, an itx may
    not be committed to an lwb (e.g. if allocating the "next" ZIL block
    on disk fails), so this list is used to keep track of which itxs
    fall into this state, such that their callbacks can be called after
    the ZIL's writer pipeline is "stalled".

  * The logic to actually call the itx's callback was moved into the
    zil_itx_destroy() function. Since all consumers of zil_itx_destroy()
    were effectively performing the same logic (i.e. if callback is
    non-null, call the callback), it seemed like useful code cleanup to
    consolidate this logic into a single function.

Additionally, the existing Linux tracepoint infrastructure dealing with
the ZIL's probes and structures had to be updated to reflect these code
changes. Specifically:

  * The "zil__cw1" and "zil__cw2" probes were removed, so they had to be
    removed from "trace_zil.h" as well.

  * Some of the zilog structure's fields were removed, which affected
    the tracepoint definitions of the structure.

  * New tracepoints had to be added for the following 3 new probes:
      * zil__process__commit__itx
      * zil__process__normal__itx
      * zil__commit__io__error

OpenZFS-issue: https://www.illumos.org/issues/8585
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5d95a3a
Closes #6566
2017-12-05 09:39:16 -08:00
Brian Behlendorf 7b3407003f Fix NFS sticky bit permission denied error
When zfs_sticky_remove_access() was originally adapted for Linux
a typo was made which altered the intended behavior.  As described
in the block comment, the intended behavior is that permission
should be granted when the entry is a regular file and you have
write access.  That is, S_ISREG should have been used instead of
S_ISDIR.

Restricting permission to regular files made good sense for older
systems where setting the bit on executable files would instruct
the system to save the program's text segment on the swap device.

On modern systems this behavior has been replaced by the sticky
bit acting as a restricted deletion flag and the plain file
restriction has been relaxed.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6889 
Closes #6910
2017-12-04 11:55:57 -08:00
JKDingwall 9717fe052b Add /usr/bin/env to COPY_EXEC_LIST initramfs hook
5dc1ff29 changed the user space program to mount a zfs snapshot
from /bin/sh to /usr/bin/env.  If the executable is not present
in the initramfs then snapshots cannot be automounted.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: James Dingwall <james.dingwall@zynstra.com>
Closes #5360 
Closes #6913
2017-12-04 11:53:57 -08:00
Brian Behlendorf ea39f75f64 Fix 'zpool create|add' replication level check
When the pool configuration contains a hole due to a previous device
removal ignore this top level vdev.  Failure to do so will result in
the current configuration being assessed to have a non-uniform
replication level and the expected warning will be disabled.

The zpool_add_010_pos test case was extended to cover this scenario.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6907 
Closes #6911
2017-12-04 11:50:35 -08:00
Brian Behlendorf 72841b9fd9 Preserve itx alloc size for zio_data_buf_free()
Using zio_data_buf_alloc() to allocate the itx's may be unsafe
because the itx->itx_lr.lrc_reclen field is not constant from
allocation to free.  Using a different itx->itx_lr.lrc_reclen
size in zio_data_buf_free() can result in the allocation being
returned to the wrong kmem cache.

This issue can be avoided entirely by storing the allocation size
in itx->itx_size and using that for zio_data_buf_free().

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6912
2017-12-04 11:44:39 -08:00
Tom Caputi d4677269f2 Unbreak the scan status ABI
When d4a72f23 was merged, pss_pass_issued was incorrectly
added to the middle of the pool_scan_stat_t structure
instead of the end. This patch simply corrects this issue.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6909
2017-11-30 09:40:13 -08:00
LOLi ed15d54481 Fix 'zfs get {user|group}objused@' functionality
Fix a regression accidentally introduced in 1b81ab4 that prevents
'zfs get {user|group}objused@' from correctly reporting the requested
value.

Update "userspace_003_pos.ksh" and "groupspace_003_pos.ksh" to verify
this functionality.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6908
2017-11-29 11:59:22 -08:00
Mark Wright 56d8d8ace4 Linux 4.14 compat: CONFIG_GCC_PLUGIN_RANDSTRUCT
Fix build errors with gcc 7.2.0 on Gentoo with kernel 4.14
built with CONFIG_GCC_PLUGIN_RANDSTRUCT=y such as:

module/nvpair/nvpair.c:2810:2:error:
positional initialization of field in ?struct? declared with
'designated_init' attribute [-Werror=designated-init]
  nvs_native_nvlist,
  ^~~~~~~~~~~~~~~~~

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mark Wright <gienah@gentoo.org>
Closes #5390 
Closes #6903
2017-11-28 17:33:48 -06:00
Richard Laager 48ac22d855 initramfs: Honor canmount=off
The initramfs script was not honoring canmount=off.  With this change,
it does.  If the administrator has asked that a filesystem not be
mounted, that should be honored.

As an exception, the initramfs script ignores canmount=off on the
rootfs.  The rootfs should not have canmount=off set either.  However,
mounting it anyway seems harmless because it is being asked for
explicitly.  The point of this exception is to avoid the risk of
breaking existing systems, just in case someone has canmount=off set on
their rootfs.

The initramfs still mounts filesystems with canmount=noauto.  This is
necessary because it is typical to set that on the rootfs so that it can
be cloned.  Without canmount=noauto, the clones' duplicate mountpoints
would conflict.

This is the remainder of the fix for:
https://github.com/zfsonlinux/pkg-zfs/issues/221

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #6897
2017-11-28 09:38:13 -08:00
Richard Laager bd2958dea0 initramfs: Honor mountpoint=none/legacy
For filesystems that are children of the rootfs, when mountpoint=none or
mountpoint=legacy, the initrafms script would assume a mountpoint based
on the dataset path.  Given that the rootfs should have mountpoint=/ and
mountpoint inheritance is is the default behavior of ZFS, this behavior
seems unnecessary.  In any event, it turns mountpoint=none into a no-op.
That removes this option from the administrator, and if someone uses it,
it does not work as expected.  Worse yet, if the mountpoint directory
does not exist (which is the typical case for mountpoint=none), the
mounting and thus the boot process will fail.  For the case of
mountpoint=legacy, the assumed mountpoint may not be the correct value
set in /etc/fstab.

This change makes the initramfs script not mount the filesystem in
either case.  For mountpoint=none, this means we are correctly honoring
the setting.  For mountpoint=legacy, there are two scenarios:  If
canmount=on, the filesystem will be mounted by the normal mechanisms
later in the boot process.  If canmount=noauto, the filesystem will not
be mounted at all, unless the administrator has done something special.
If they're not doing something special and they want it mounted by the
initramfs, they can simply not set mountpoint=legacy.

This is part of the fix for:
https://github.com/zfsonlinux/pkg-zfs/issues/221

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #6897
2017-11-28 09:38:00 -08:00
DeHackEd 1c68856bca zpool(8): Fix "zpool import -t"
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: DHE <git@dehacked.net>
Closes #6894
2017-11-28 11:10:52 -06:00
Brian Behlendorf 94183a9d8a Update for cppcheck v1.80
Resolve new warnings and errors from cppcheck v1.80.

* [lib/libshare/libshare.c:543]: (warning)
  Possible null pointer dereference: protocol
* [lib/libzfs/libzfs_dataset.c:2323]: (warning)
  Possible null pointer dereference: srctype
* [lib/libzfs/libzfs_import.c:318]: (error)
  Uninitialized variable: link
* [module/zfs/abd.c:353]: (error) Uninitialized variable: sg
* [module/zfs/abd.c:353]: (error) Uninitialized variable: i
* [module/zfs/abd.c:385]: (error) Uninitialized variable: sg
* [module/zfs/abd.c:385]: (error) Uninitialized variable: i
* [module/zfs/abd.c:553]: (error) Uninitialized variable: i
* [module/zfs/abd.c:553]: (error) Uninitialized variable: sg
* [module/zfs/abd.c:763]: (error) Uninitialized variable: i
* [module/zfs/abd.c:763]: (error) Uninitialized variable: sg
* [module/zfs/abd.c:305]: (error) Uninitialized variable: tmp_page
* [module/zfs/zpl_xattr.c:342]: (warning)
   Possible null pointer dereference: value
* [module/zfs/zvol.c:208]: (error) Uninitialized variable: p

Convert the following suppression to inline.

* [module/zfs/zfs_vnops.c:840]: (error)
  Possible null pointer dereference: aiov

Exclude HAVE_UIO_ZEROCOPY and HAVE_DNLC from analysis since
these macro's will never be defined until this functionality
is implemented.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6879
2017-11-18 14:08:00 -08:00
Scot W. Stevenson 8d18776973 Fix data on evict_skips in arc_summary.py
Display correct data from kstat arcstats for evict_skips,
which is currently repeating the data from mutex_misses.
Fixes #6882

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes #6882 
Closes #6883
2017-11-18 14:07:04 -08:00
DeHackEd da5d4697a8 Fix ARC pointer overrun
Only access the `b_crypt_hdr` field of an ARC header if the content
is encrypted.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #6877
2017-11-17 15:11:39 -08:00
Tom Caputi d4a72f2386 Sequential scrub and resilvers
Currently, scrubs and resilvers can take an extremely
long time to complete. This is largely due to the fact
that zfs scans process pools in logical order, as
determined by each block's bookmark. This makes sense
from a simplicity perspective, but blocks in zfs are
often scattered randomly across disks, particularly
due to zfs's copy-on-write mechanisms.

This patch improves performance by splitting scrubs
and resilvers into a metadata scanning phase and an IO
issuing phase. The metadata scan reads through the
structure of the pool and gathers an in-memory queue
of I/Os, sorted by size and offset on disk. The issuing
phase will then issue the scrub I/Os as sequentially as
possible, greatly improving performance.

This patch also updates and cleans up some of the scan
code which has not been updated in several years.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored-by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Authored-by: Alek Pinchuk <apinchuk@datto.com>
Authored-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #3625 
Closes #6256
2017-11-15 17:27:01 -08:00
Brian Behlendorf ed19bccfb6 Linux 4.14 compat: vfs_read & vfs_write
The kernel_read & kernel_write functions have always wrapped the
vfs_read & vfs_write functions respectively.  However, they could
not be used by vn_rdwr() since the offset wasn't passed as a
pointer.  This prevented us from being able to properly update
the file offset.

Linux 4.14 unexported vfs_read & vfs_write but also changed the
signature of kernel_read & kernel_write to provide the needed
functionality.  Use these updated functions when available.

Reviewed-by: Pritam Baral <pritam@pritambaral.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #656 
Closes #667
2017-11-15 17:19:23 -08:00
Scot W. Stevenson e301113c17 Minor code cleanups in arc_python.py
Remove unused library re and associated variable kstat_pobj. Add note
to documentation at start of program about required support for old
versions of Python. Change variable "format" (which is a built-in
function) to "fmt".

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes #6869
2017-11-15 10:28:11 -08:00
Brian Behlendorf 454365bbaa Fix dirty check in dmu_offset_next()
The correct way to determine if a dnode is dirty is to check
if any of the dn->dn_dirty_link's are active.  Relying solely
on the dn->dn_dirtyctx can result in the dnode being mistakenly
reported as clean.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3125 
Closes #6867
2017-11-15 10:19:32 -08:00
Brian Behlendorf 13589da974 Disable automatic dependencies in zfs-test package
All of the ZTS test scripts specify /bin/ksh as the interpreter.
Unfortunately, as of Fedora 27 only /usr/bin/ksh is provided by
the package manager.  Rather than change all the scripts to
accommodate the latest Fedora disable automatic dependencies
for the zfs-test package.  Functionally this will not cause
any problems since /bin is a symlink to /usr/bin.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6868
2017-11-15 09:12:52 -08:00
Brian Behlendorf 71788d91f4 Disable zvol_ENOSPC_001_pos on 32-bit systems
Occasionally observed failure of zvol_ENOSPC_001_pos due to the
test case taking too long to complete.  Disable the test case until
it can be improved.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5848 
Closes #6862
2017-11-13 16:26:15 -08:00
LOLi 99834d1950 Fix truncate(2) mtime and ctime handling
On Linux, ftruncate(2) always changes the file timestamps, even if the
file size is not changed. However, in case of a successfull
truncate(2), the timestamps are updated only if the file size changes.
This translates to the VFS calling the ZFS Posix Layer "setattr"
function (zpl_setattr) with ATTR_MTIME and ATTR_CTIME unconditionally
set on the iattr mask only when doing a ftruncate(2), while the
truncate(2) is left to the filesystem implementation to be dealt with.

This behaviour is consistent with POSIX:2004/SUSv3 specifications
where there's no explicit requirement for file size changes to update
the timestamps only for ftruncate(2):

http://pubs.opengroup.org/onlinepubs/009695399/functions/truncate.html
http://pubs.opengroup.org/onlinepubs/009695399/functions/ftruncate.html

This has been later updated in POSIX:2008/SUSv4 where, for both
truncate(2)/ftruncate(2), there's no mention of this size change
requirement:

http://austingroupbugs.net/view.php?id=489
http://pubs.opengroup.org/onlinepubs/9699919799/functions/truncate.html
http://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html

Unfortunately the Linux VFS is still calling into the ZPL without
ATTR_MTIME/ATTR_CTIME set in the truncate(2) case: we fix this by
explicitly updating the timestamps when detecting the ATTR_SIZE bit,
which is always set in do_truncate(), on the iattr mask.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6811 
Closes #6819
2017-11-13 09:24:26 -08:00
George Melikov 13042a6ccd Add .travis.yml
Travis builders have maximum work time ~49 minutes,
so we have to use 5 builders and spread the ZTS over
them using test group tags.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #6829
2017-11-13 09:18:18 -08:00
Scot W. Stevenson 5277f208f2 Fix arc_summary.py -d crash with Python3
Prevents arc_summary.py crashing when called with parameter -d or
long form --description with Python3.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes #6849 
Closes #6850
2017-11-11 20:27:43 -08:00
benrubson 7c351e31d5 OpenZFS 7531 - Assign correct flags to prefetched buffers
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Authored by: abraunegg <alex.braunegg@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7531
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/468008cb
2017-11-11 20:24:34 -08:00
Arkadiusz Bubała c0daec32f8 Long hold the dataset during upgrade
If the receive or rollback is performed while filesystem is upgrading
the objset may be evicted in `dsl_dataset_clone_swap_sync_impl`. This
will lead to NULL pointer dereference when upgrade tries to access
evicted objset.

This commit adds long hold of dataset during whole upgrade process.
The receive and rollback will return an EBUSY error until the
upgrade is not finished.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com>
Closes #5295 
Closes #6837
2017-11-10 13:37:10 -08:00
Tom Caputi 62df1bc813 Fix encryption root hierarchy issue
After doing a recursive raw receive, zfs userspace performs
a final pass to adjust the encryption root hierarchy as
needed. Unfortunately, the FORCE_INHERIT ioctl had a bug
which caused the encryption root to always be assigned to
the direct parent instead of the inheriting parent. This
patch simply fixes this issue.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6847 
Closes #6848
2017-11-08 15:25:30 -08:00
Tim Chase 71a24c3c52 Handle compressed buffers in __dbuf_hold_impl()
In __dbuf_hold_impl(), if a buffer is currently syncing and is still
referenced from db_data, a copy is made in case it is dirtied again in
the txg.  Previously, the buffer for the copy was simply allocated with
arc_alloc_buf() which doesn't handle compressed or encrypted buffers
(which are a special case of a compressed buffer).  The result was
typically an invalid memory access because the newly-allocated buffer
was of the uncompressed size.

This commit fixes the problem by handling the 2 compressed cases,
encrypted and unencrypted, respectively, with arc_alloc_raw_buf() and
arc_alloc_compressed_buf().

Although using the proper allocation functions fixes the invalid memory
access by allocating a buffer of the compressed size, another unrelated
issue made it impossible to properly detect compressed buffers in the
first place.  The header's compression flag was set to ZIO_COMPRESS_OFF
in arc_write() when it was possible that an attached buffer was actually
compressed.  This commit adds logic to only set ZIO_COMPRESS_OFF in
the non-ZIO_RAW case which wil handle both cases of compressed buffers
(encrypted or unencrypted).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #5742 
Closes #6797
2017-11-08 13:32:15 -08:00
Giuseppe Di Natale eef005d882 Refresh TEST file to include new variables
Add any missing variables for CI control to TEST.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6846
2017-11-08 11:09:30 -08:00
Antonio Russo 80b485246a Cleanup systemd dependencies
Some redundancy is present in the systemd dependencies, as
noticed in PR#6764. Existing setups might rely on these quirks,
so these cleanups have been moved to the development branch.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #6822
2017-11-08 09:39:15 -08:00
LOLi 011ef12c7a Fix undefined %{systemd_svcs} in RPM scriptlets
This allows RPM-based systems to properly control package installation
and removal when using systemd.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6838 
Closes #6841
2017-11-08 09:16:37 -08:00
Brian Behlendorf d8fdfc2d65 OpenZFS 8607 - variable set but not used
Reviewed by: Yuri Pankov <yuripv@gmx.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Authored by: Toomas Soome <tsoome@me.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8607
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b852c2f5
Closes #6842
2017-11-08 09:09:45 -08:00
Giuseppe Di Natale 87fbf43636 Provide tags in perf-regression.run
A prior commit changed test-runner to enable tagging
of testgroups within a test suite runfile. They must
be specified in each runfile. Update the runfile for
performance regressions so it is properly tagged.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6830
2017-11-07 15:06:27 -08:00
LOLi 271955da3e Fix zfs-tests.sh single test functionality
Without any tag specified into the runtime-generated runfile the
test-runner will not execute the test provided from the command line:
fix this by adding tag information to the custom runfile.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6826
2017-11-07 14:55:31 -08:00
LOLi cb3b0419ba contrib/initramfs: switch to automake
Use automake to build initramfs scripts and hooks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6761
2017-11-07 14:53:57 -08:00
wli5 a3df7fa79d Bug fix in qat_compress.c when compressed size is < 4KB
When the 128KB block is compressed to less than 4KB, the pointer
to the Footer is not in the end of the compressed buffer, that's
because the Header offset was added twice for this case. So there
is a gap between the Footer and the compressed buffer.
1. Always compute the Footer pointer address from the start of the
last page.
2. Remove the un-used workaroud code which has been verified fixed
with the latest driver and this fix.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Closes #6827
2017-11-07 14:51:30 -08:00
Scot W. Stevenson 681957fe2e Sort output of tunables in arc_summary.py
Sort list of tunables printed by _tunable_summary()
alphabetically

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes #6828
2017-11-07 14:50:15 -08:00
Brian Behlendorf 12954494a1 Disable automatic dependencies in DKMS package
By default additional dependencies are generated automatically for
packages.  This is normally a good thing because it helps ensure
things just work.  It doesn't make sense for the DKMS package which
requires minimal dependencies that can be easily listed.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6467 
Closes #6835
2017-11-07 10:59:27 -08:00
Don Brady 31df97cdab Build regression in c89 cleanups
Fixed build regression in non-debug builds from recent cleanups of
c89 workarounds.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #6832
2017-11-07 10:42:15 -08:00
George Melikov b58b73ce74 Disable zpool_import_missing_003_pos
Rarely observed failure of zpool_import_missing_003_pos during
automated testing due to timeout.  Disable the test case until
it can be improved.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Issue #6839 
Closes #6840
2017-11-07 10:32:04 -08:00
Scot W. Stevenson 23ea00a1fe Add documentation strings to arc_summary.py
Include docstrings (PEP8, PEP257) for module and all functions.
Separately, remove outdated section in comment at start of
module. Separately, remove unused global constant "usetunable".

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes #6818
2017-11-05 13:11:37 -08:00
George G 2df9ad1c07 Fix column alignment with long zpool names
`zpool status` normally aligns NAME/STATE/etc columns:

    NAME                       STATE     READ WRITE CKSUM
    dummy                      ONLINE       0     0     0
      mirror-0                 ONLINE       0     0     0
        /tmp/dummy-long-1.bin  ONLINE       0     0     0
        /tmp/dummy-long-2.bin  ONLINE       0     0     0
      mirror-1                 ONLINE       0     0     0
        /tmp/dummy-long-3.bin  ONLINE       0     0     0
        /tmp/dummy-long-4.bin  ONLINE       0     0     0

However, if the zpool name is longer than the zvol names, alignment
issues arise:

    NAME                  STATE     READ WRITE CKSUM
    dummy-very-very-long-zpool-name  ONLINE       0     0     0
      mirror-0            ONLINE       0     0     0
        /tmp/dummy-1.bin  ONLINE       0     0     0
        /tmp/dummy-2.bin  ONLINE       0     0     0
      mirror-1            ONLINE       0     0     0
        /tmp/dummy-3.bin  ONLINE       0     0     0
        /tmp/dummy-4.bin  ONLINE       0     0     0

`zpool iostat` and `zpool import` are also affected:

                  capacity     operations     bandwidth
    pool        alloc   free   read  write   read  write
    ----------  -----  -----  -----  -----  -----  -----
    dummy        104K  1.97G      0      0    152  9.84K
    dummy-very-very-long-zpool-name   152K  1.97G      0      1    144  13.1K
    ----------  -----  -----  -----  -----  -----  -----

    dummy-very-very-long-zpool-name  ONLINE
      mirror-0            ONLINE
        /tmp/dummy-1.bin  ONLINE
        /tmp/dummy-2.bin  ONLINE
      mirror-1            ONLINE
        /tmp/dummy-3.bin  ONLINE
        /tmp/dummy-4.bin  ONLINE

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Gaydarov <git@gg7.io>
Closes #6786
2017-11-05 13:09:56 -08:00
Scot W. Stevenson cd1813d36e Rewrite fHits() in arc_summary.py with SI units
Complete rewrite of fHits(). Move units from non-standard English
abbreviations to SI units, thereby avoiding confusion because of
"long scale" and "short scale" numbers. Remove unused parameter
"Decimal". Add function string. Aim to confirm to PEP8.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes #6815
2017-11-04 13:33:28 -07:00
Don Brady 1c27024e22 Undo c89 workarounds to match with upstream
With PR 5756 the zfs module now supports c99 and the
remaining past c89 workarounds can be undone.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Don Brady <don.brady@delphix.com>
Closes #6816
2017-11-04 13:25:13 -07:00
Scot W. Stevenson df1f129bc4 Minor code cleanup in arc_summary.py
Simplify and inline single-use function div1(); inline twice-used
function div2(); add function comment to zfs_header(); replace
variable "unused" in get_Kstat() with "_" following convention.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes #6802
2017-11-03 15:43:53 -07:00
Brian Behlendorf 34c2b3680b Initramfs fixes
* initramfs: Fix inconsistent whitespace
* initramfs: Fix a spelling error
* initramfs: Set elevator=noop on the rpool's disks

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes #6807
2017-11-03 15:38:52 -07:00
Giuseppe Di Natale 9a810efb02 Allow test-runner to filter test groups by tag
Enable test-runner to accept a list of tags to identify
which test groups the user wishes to run.

Also allow test-runner to perform multiple iterations
of a test run.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6788
2017-11-03 09:53:32 -07:00
Richard Laager 4fc411f7a3 initramfs: Set elevator=noop on the rpool's disks
ZFS already sets elevator=noop for wholedisk vdevs (for all pools), but
typical root-on-ZFS installations use partitions.  This sets
elevator=noop on the disks in the root pool.

Ubuntu 16.04 and 16.10 had this.  It was lost in 17.04 due to Debian
switching to this upstream initramfs script.

Signed-off-by: Richard Laager <rlaager@wiktel.com>
2017-11-01 21:54:56 -05:00
Richard Laager 11b9dcfb2d initramfs: Fix a spelling error
This fixes a typo in a comment.

Signed-off-by: Richard Laager <rlaager@wiktel.com>
2017-11-01 21:54:28 -05:00
Richard Laager 4767c7a14e initramfs: Fix inconsistent whitespace
This fixes one instance of inconsistent whitespace.

Signed-off-by: Richard Laager <rlaager@wiktel.com>
2017-11-01 21:53:22 -05:00
Brian Behlendorf c9427c4696 Add scan.coverity.com badge to README
Include the scan.coverity.com status badge in the top level README.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6801
2017-10-30 16:21:24 -07:00
Jason King f3c8c9e6f0 OpenZFS 640 - number_to_scaled_string is duplicated in several commands
Porting Notes:
- The OpenZFS patch added nicenum_scale() and nicenum() to a
  library not used by ZFS.  Rather than pull in a new dependency
  the version of nicenum in lib/libzpool/util.c was simply
  replaced with the new one.

Reviewed by: Sebastian Wiedenroth <wiedi@frubar.net>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Yuri Pankov <yuripv@gmx.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Authored by: Jason King <jason.brian.king@gmail.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/640
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0a055120
Closes #6796
2017-10-30 14:47:20 -07:00
Scot W. Stevenson 47c8e7fd97 Rewrite of function fBytes() in arc_summary.py
Replace if-elif-else construction with shorter loop;
remove unused parameter "Decimal"; centralize format
string; add function documentation string; conform to
PEP8.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Scot W. Stevenson <scot.stevenson@gmail.com>
Closes #6784
2017-10-30 14:44:35 -07:00
Antonio Russo 5c2552c564 systemd zfs-import.target and documentation
zfs-import-{cache,scan}.service must complete before any mounting of
filesystems can occur. To simplify this dependency, create a target
that is reached After (in the systemd sense) the pool is imported.

Additionally, recommend that legacy zfs mounts use the option

x-systemd.requires=zfs-import.target

to codify this requirement.

Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #6764
2017-10-30 13:18:26 -07:00
abraunegg ca85d69097 Update zfs module parameters man5
Update zfs module parameters man5 with missing parameter details
for multiple tunings.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Alex Braunegg <alex.braunegg@gmail.com>
Closes #6785
2017-10-30 13:15:10 -07:00
James Cowgill 35a44fcb8d Remove all spin_is_locked calls
On systems with CONFIG_SMP turned off, spin_is_locked always returns
false causing these assertions to fail. Remove them as suggested in
zfsonlinux/zfs#6558.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: James Cowgill <james.cowgill@mips.com>
Closes #665
2017-10-30 11:16:56 -07:00
Brian Behlendorf f4ae39a19d Fix status command options in zpool(8)
The 'zpool status' command supports the -P option for printing full
path names.  It does not support the -p parsable option for printing
exact values.
    
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6792 
Closes #6794
2017-10-27 15:52:03 -07:00
Brian Behlendorf 8be3688999 Remove vn_rename and vn_remove
Both vn_rename and vn_remove have been historically problematic
to implement reliably.  Rather than fixing them yet again they
are being removed.

Reviewed-by: Arkadiusz Bubala <arkadiusz.bubala@open-e.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #648 
Closes #661
2017-10-27 15:49:14 -07:00
abraunegg 8fc533725f Update spl module parameters man5 with missing parameter details
Update spl module parameters man5 with the following missing parameter
details for spl_panic_halt.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alex Braunegg <alex.braunegg@gmail.com>
Closes #664
2017-10-27 15:46:34 -07:00
Brian Behlendorf 867959b588 OpenZFS 8081 - Compiler warnings in zdb
Fix compiler warnings in zdb.  With these changes, FreeBSD can compile
zdb with all compiler warnings enabled save -Wunused-parameter.

usr/src/cmd/zdb/zdb.c
usr/src/cmd/zdb/zdb_il.c
usr/src/uts/common/fs/zfs/sys/sa.h
usr/src/uts/common/fs/zfs/sys/spa.h
	Fix numerous warnings, including:
	* const-correctness
	* shadowing global definitions
	* signed vs unsigned comparisons
	* missing prototypes, or missing static declarations
	* unused variables and functions
	* Unreadable array initializations
	* Missing struct initializers

usr/src/cmd/zdb/zdb.h
	Add a header file to declare common symbols

usr/src/lib/libzpool/common/sys/zfs_context.h
usr/src/uts/common/fs/zfs/arc.c
usr/src/uts/common/fs/zfs/dbuf.c
usr/src/uts/common/fs/zfs/spa.c
usr/src/uts/common/fs/zfs/txg.c
	Add a function prototype for zk_thread_create, and ensure that every
	callback supplied to this function actually matches the prototype.

usr/src/cmd/ztest/ztest.c
usr/src/uts/common/fs/zfs/sys/zil.h
usr/src/uts/common/fs/zfs/zfs_replay.c
usr/src/uts/common/fs/zfs/zvol.c
	Add a function prototype for zil_replay_func_t, and ensure that
	every function of this type actually matches the prototype.

usr/src/uts/common/fs/zfs/sys/refcount.h
	Change FTAG so it discards any constness of __func__, necessary
	since existing APIs expect it passed as void *.

Porting Notes:
- Many of these fixes have already been applied to Linux.  For
  consistency the OpenZFS version of a change was applied if the
  warning was addressed in an equivalent but different fashion.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Authored by: Alan Somers <asomers@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8081
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/843abe1b8a
Closes #6787
2017-10-27 12:46:35 -07:00
Giuseppe Di Natale a94d38c0f3 Correct make mancheck recipe
The current make recipe for mancheck silently ignores errors. Correct
the recipe so errors cause the mancheck recipe fail.

The zpool reopen command in the zpool.8 manpage had a bullet list
without an .El.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6790
2017-10-27 09:52:18 -07:00
LOLi ee45fbd894 ZFS send fails to dump objects larger than 128PiB
When dumping objects larger than 128PiB it's possible for do_dump() to
miscalculate the FREE_RECORD offset due to an integer overflow
condition: this prevents the receiving end from correctly restoring
the dumped object.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6760
2017-10-26 16:58:38 -07:00
LOLi 88f9c9396b Allow 'zpool events' filtering by pool name
Additionally add four new tests:

 * zpool_events_clear: verify 'zpool events -c' functionality
 * zpool_events_cliargs: verify command line options and arguments
 * zpool_events_follow: verify 'zpool events -f'
 * zpool_events_poolname: verify events filtering by pool name

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #3285 
Closes #6762
2017-10-26 16:49:33 -07:00
Brian Behlendorf a032ac4b38 OpenZFS 8558, 8602 - lwp_create() returns EAGAIN
8558 lwp_create() returns EAGAIN on system with more than 80K ZFS filesystems

On a system with more than 80K ZFS filesystems, we've seen cases
where lwp_create() will start to fail by returning EAGAIN. The
problem being, for each of those 80K ZFS filesystems, a taskq will
be created for each dataset as part of the ZIL for each dataset.

Porting Notes:
- The new nomem taskq kstat was dropped.
- Added module options and documentation for new tunings
  zfs_zil_clean_taskq_nthr_pct, zfs_zil_clean_taskq_minalloc,
  zfs_zil_clean_taskq_maxalloc, and zfs_sync_taskq_batch_pct.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8558
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/216d772

8602 remove unused "dp_early_sync_tasks" field from "dsl_pool" structure

Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Authored by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8602
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2bcb545
Closes #6779
2017-10-26 12:57:53 -07:00
Arkadiusz Bubała d3f2cd7e3b Added no_scrub_restart flag to zpool reopen
Added -n flag to zpool reopen that allows a running scrub
operation to continue if there is a device with Dirty Time Log.

By default if a component device has a DTL and zpool reopen
is executed all running scan operations will be restarted.

Added functional tests for `zpool reopen`

Tests covers following scenarios:
* `zpool reopen` without arguments,
* `zpool reopen` with pool name as argument,
* `zpool reopen` while scrubbing,
* `zpool reopen -n` while scrubbing,
* `zpool reopen -n` while resilvering,
* `zpool reopen` with bad arguments.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com>
Closes #6076 
Closes #6746
2017-10-26 12:26:09 -07:00
Fabian-Gruenbichler 3ad59c015d arcstat: flush stdout / outfile after each line
Otherwise, if arcstat gets interrupted before the desired number of
iterations is reached, the output file will be empty (both if set via
'-o' or via shell redirection).

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes #6775
2017-10-26 12:18:49 -07:00
Giuseppe Di Natale 69b229bd60 commitcheck: Multiple OpenZFS ports in commit
Allow commitcheck.sh to handle multiple OpenZFS ports in
a single commit. This is useful in the cases when a change
upstream has bug fixes and it makes sense to port them with
the original patch.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6780
2017-10-26 10:23:58 -07:00
Giuseppe Di Natale 8dcaf243d7 Add Coverity defect fix commit checker support
Enable commitcheck.sh to test if a commit message is
in the expected format for a coverity defect fix.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6777
2017-10-26 10:17:00 -07:00
Giuseppe Di Natale 64b8c58e3e Ensure arc_size_break is filled in arc_summary.py
Use mfu_size and mru_size pulled from the arcstats
kstat file to calculate the mfu and mru percentages
for arc size breakdown.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: AndCycle <andcycle@andcycle.idv.tw>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #5526 
Closes #6770
2017-10-23 14:18:12 -07:00
Giuseppe Di Natale 63e5e960ba Correct flake8 errors after STYLE builder update
Fix new flake8 errors related to bare excepts and ambiguous
variable names due to a STYLE builder update.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6776
2017-10-23 14:01:43 -07:00
David Quigley d9daa7abcf ZTS: Add auto-spare tests
The ZED is expected to automatically kick in a hot spare device
when there's one available in the pool and a sufficient number of
read errors have been encountered.  Use zinject to simulate the
failure condition and verify the hot spare is used.

auto_spare_001_pos.ksh: read IO errors, the vdev is FAULTED
auto_spare_002_pos.ksh: read CHECKSUM errors, the vdev is DEGRADE

Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: David Quigley <david.quigley@intel.com>
Closes #6280
2017-10-23 11:42:37 -07:00
adisbladis f8cd871a01 Use ashift=12 by default on SSDSC2BW48 disks
Currently the 480GB models of this disk do not use ashift=12 by
default.  SSDSC2BW48 is also optimized for 4k blocks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: adisbladis <adis@blad.is>
Closes #6774
2017-10-23 11:00:45 -07:00
Giuseppe Di Natale 70c8a79446 Provide commit message format for Coverity defects
Provide details about the commit message format for Coverity defect
fixes submitted.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6771
2017-10-23 09:47:16 -07:00
Brian Behlendorf d5e024cba2 Emit history events for 'zpool create'
History commands and events were being suppressed for the
'zpool create' command since the history object did not
yet exist.  Create the object earlier so this history
doesn't get lost.

Split the pool_destroy event in to pool_destroy and
pool_export so they may be distinguished.

Updated events_001_pos and events_002_pos test cases.  They
now check for the expected history events and were reworked
to be more reliable.

Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6712 
Closes #6486
2017-10-23 09:45:59 -07:00
wli5 1cfdb0e6e4 Support integration with new QAT products
Support integration with new QAT products: Intel(R) C62x Chipset,
or Atom(R) C3000 Processor Product Family SoC:
1. Detect new file name in auto-conf.
2. Change MAX_INSTANCES to 48.
3. Change "num_inst" to U16 to clean a build warning.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Weigang Li <weigang.li@intel.com>
Closes #6767
2017-10-20 11:11:25 -07:00
John 6044cf59cd Add convenience 'zfs_get' functions
Add get functions to match existing ones.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Ramsden <johnramsden@riseup.net>
Closes #6308
2017-10-19 11:18:42 -07:00
Brian Behlendorf bbf1ad67cd Remove vn_rename and vn_remove dependency
The only place vn_rename and vn_remove are used is when writing
out an updated pool configuration file.  By truncating the file
instead of renaming and removing it we can avoid having to implement
these interfaces entirely.  Functionally an empty cache file is
treated the same as a missing cache file.  This is particularly
advantageous because the Linux kernel has never provided a way
to reliably implement vn_rename and vn_remove.

The cachefile_004_pos.ksh test case was updated to understand
that an empty cache file is the same as a missing one.

The zfs-import-* systemd service files were not updated to use
ConditionFileNotEmpty in place of ConditionPathExists.  This
means that after exporting all pools and rebooting new pools
will not the scanned for on the next boot.  This small change
should not impact normal usage since pools are not exported
as part of a normal shutdown.

Documentation was updated accordingly.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/spl#648 
Closes #6753
2017-10-19 10:06:55 -07:00
Tom Caputi 35df0bb556 Fix ASSERT in dmu_free_long_object_raw()
This small patch fixes an issue where dmu_free_long_object_raw()
calls dnode_hold() after freeing the dnode a line above.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6766
2017-10-18 10:08:36 -07:00
Brian Behlendorf ca9b8e8797 Update codecov.io behavior
Update the codecov.yml included in the repository to behave as
originally intended.  This can be refined as needed.

* Always post coverage results to the GitHub PR after two builds
  have been uploaded.  This is the normal case since there will
  be a build uploaded for both kernel and user coverage results.

* Adjust red -> yellow -> green coloring in the web interface.
  Due to the number of unlikely error conditions which are hard
  to force consider 90% coverage an excellent level of coverage.

* Allow a 1% variance in coverage between test runs.  This is
  approximately 10x larger than the typical variance observed
  which leaves us a reasonable margin to prevent false positives.

* Always post a new smaller comment to PRs which does not include
  a file list.  Old coverage reports are removed.

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6765
2017-10-18 10:07:02 -07:00
Tobin Harding c721ba435f Fix coverity defects: CID 161388
CID 161388: Resource Leak (REASOURCE_LEAK)

Jump to errout so that file descriptor gets closed before returning
from function.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Closes #6755
2017-10-17 09:37:50 -07:00
Tobin Harding ced28193b0 Fix coverity defects: 147480, 147584
CID 147480: Logically dead code (DEADCODE)

Remove non-null check and subsequent function call. Add ASSERT to future
proof the code.

usage label is only jumped to before `zhp` is initialized.

CID 147584: Out-of-bounds access (OVERRUN)

Subtract length of current string from buffer length for `size` argument
to `snprintf`.

Starting address for the write is the start of the buffer + the current
string length. We need to subtract this string length else risk a buffer
overflow.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Closes #6745
2017-10-16 15:32:48 -07:00
Neal Gompa (ニール・ゴンパ) 7670f721fc Add DKMS package on Debian-based distributions
* config/deb.am: Enable building DKMS packages for Debian
* rpm/generic/zfs-dkms.spec.in: Adjust spec to be Debian-compatible
  * Condition kernel-devel Req to RPM distros
  * Adjust the DKMS Req to have a minimum of a version only
  * Ensure that --rpm_safe_upgrade isn't used on non-RPM distros
* config/deb.am: Drop CONFIG_KERNEL and CONFIG_USER guards
* Makefile.am: Add pkg-dkms target

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Neal Gompa <ngompa@datto.com>
Closes #6044 
Closes #6731
2017-10-15 13:00:44 -07:00
Neal Gompa (ニール・ゴンパ) 28920ea334 Add DKMS package on Debian-based distributions
* config/deb.am: Enable building DKMS packages for Debian
* rpm/generic/spl-dkms.spec.in: Adjust spec to be Debian-compatible
  * Condition kernel-devel Requires to RPM distros
  * Ensure that --rpm_safe_upgrade isn't used on non-RPM distros
* config/deb.am: Drop CONFIG_KERNEL and CONFIG_USER guards
* Makefile.am: Add pkg-dkms target

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Neal Gompa <ngompa@datto.com>
Closes #657
2017-10-15 12:58:12 -07:00
Tobin Harding c616dcf8bc Fix function documentation to correctly mirror code
Currently the function documentation states that two strings are 
allocated, this is outdated. Only one char ** parameter is passed 
into the function now, clearly only a pointer to a single string 
is returned and needs to be free'd.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Closes #6754
2017-10-13 12:42:04 -07:00
Brian Behlendorf aea899a6fa Increase default zloop.sh vdev size
The default 128M vdev size used by zloop.sh isn't always large
enough and can result in ENOSPC failures which suspend the pool.
Increase the default size to 512M and provide a -s option which
can be used to specify an alternate size.

This does increase the free space requirements to run zloop.sh.
However, since the vdevs are sparse 4x the space is not required.

Reviewed-by: Don Brady <don.brady@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6758
2017-10-13 12:39:39 -07:00
Brian Behlendorf 21a932b83c Post-Encryption Followup
This PR includes fixes for bugs and documentation issues found 
after the encryption patch was merged and general code improvements 
for long-term maintainability.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #6526
Closes #6639
Closes #6703
Cloese #6706
Closes #6714
Closes #6595
2017-10-13 10:02:39 -07:00
Damian Wojsław cdc15a7604 Typo in dsl_dataset.h
The parameters dsl_dataset_t *os in function prototype should be
renamed to dsl_dataset_t *ds.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Damian Wojsław <damian@wojslaw.pl>
Closes #6756 
Closes #6273
2017-10-12 17:10:38 -07:00
Brian Behlendorf e0922b0421 Fixes for SPARC support
The current code base almost compiles on SPARC, but a few fixes are
required for the code to compile (and work efficiently). Code in this 
PR comes from OpenZFS project which was initially dropped when porting
the crypto framework.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pengcheng Xu <i@jsteward.moe>
Closes #6733 
Closes #6738 
Closes #6750
2017-10-12 09:51:56 -07:00
Antonio Russo 085b501fb8 Explicitly depend on icp module in initramfs hook
Automatic dependency resolution is unreliable on many systems.
Follow suit with existing code, and explicitly include icp
in module dependencies.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Antonio Russo <antonio.e.russo@gmail.com>
Closes #6751
2017-10-12 09:39:45 -07:00
Tom Caputi 9bae371ce6 Fix for #6714
This 2 line patch fixes a possible integer overflow reported by grsec.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
2017-10-11 16:59:42 -04:00
Tom Caputi 2637dda8f8 Fix for #6706
This patch resolves an issue where raw sends would fail to send
encryption parameters if the wrapping key was unloaded and reloaded
before the data was sent and the dataset wass not an encryption root.
The code attempted to lookup the values from the wrapping key which
was not being initialized upon reload. This change forces the code to
lookup the correct value from the encryption root's DSL Crypto Key.
Unfortunately, this issue led to the on-disk DSL Crypto Key for some
non-encryption root datasets being left with zeroed out encryption
parameters. However, this should not present a problem since these
values are never looked at and are overrwritten upon changing keys.

This patch also fixes an issue where raw, resumable sends were not
being cleaned up appropriately if an invalid DSL Crypto Key was
received.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
2017-10-11 16:58:39 -04:00
Tom Caputi b135b9f11a Fix for #6703
This patch resolves an issue where spa_keystore_change_key_sync_impl()
incorrectly recursed into clone DSL Directories while recursively
rewrapping encryption keys. Clones share keys with their origins, so
this logic was incorrect.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
2017-10-11 16:57:22 -04:00
Tom Caputi 440a3eb939 Fixes for #6639
Several issues were uncovered by running stress tests with zfs
encryption and raw sends in particular. The issues and their
associated fixes are as follows:

* arc_read_done() has the ability to chain several requests for
  the same block of data via the arc_callback_t struct. In these
  cases, the ARC would only use the first request's dsobj from
  the bookmark to decrypt the data. This is problematic because
  the first request might be a prefetch zio which is able to
  handle the key not being loaded, while the second might use a
  different key that it is sure will work. The fix here is to
  pass the dsobj with each individual arc_callback_t so that each
  request can attempt to decrypt the data separately.

* DRR_FREE and DRR_FREEOBJECT records in a send file were not
  having their transactions properly tagged as raw during raw
  sends, which caused a panic when the dbuf code attempted to
  decrypt these blocks.

* traverse_prefetch_metadata() did not properly set
  ZIO_FLAG_SPECULATIVE when issuing prefetch IOs.

* Added a few asserts and code cleanups to ensure these issues
  are more detectable in the future.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
2017-10-11 16:55:50 -04:00
Tom Caputi 4807c0badb Encryption patch follow-up
* PBKDF2 implementation changed to OpenSSL implementation.

* HKDF implementation moved to its own file and tests
  added to ensure correctness.

* Removed libzfs's now unnecessary dependency on libzpool
  and libicp.

* Ztest can now create and test encrypted datasets. This is
  currently disabled until issue #6526 is resolved, but
  otherwise functions as advertised.

* Several small bug fixes discovered after enabling ztest
  to run on encrypted datasets.

* Fixed coverity defects added by the encryption patch.

* Updated man pages for encrypted send / receive behavior.

* Fixed a bug where encrypted datasets could receive
  DRR_WRITE_EMBEDDED records.

* Minor code cleanups / consolidation.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
2017-10-11 16:54:48 -04:00
Tom Caputi 94d49e8f9b Relax ASSERT for #6526
This patch resolves a minor issue where an ASSERT in
metaslab_passivate() that only applies to non weight-based
metaslabs was erroneously applied to all metaslabs.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
2017-10-11 16:53:37 -04:00
KireinaHoro d9ee0e2621 Remove useless DEFAULT_INCLUDES in AM_CCASFLAGS
CPPASCOMPILE and LTCPPASCOMPILE all include DEFAULT_INCLUDES,
hence it's unnecessary to add the includes again.

Signed-off-by: Pengcheng Xu <i@jsteward.moe>
2017-10-12 01:42:05 +08:00
KireinaHoro e102b1b515 Fix libspl assembler flags to respect cpu type
It's important to respect the user's CFLAGS as mismatched -mcpu
will directly result in the assembler not able to produce correct
code. Fixes #6733.

Signed-off-by: Pengcheng Xu <i@jsteward.moe>
2017-10-12 01:36:16 +08:00
KireinaHoro a7ec8c47e2 SPARC optimizations for Encode()
Normally a SPARC processor runs in big endian mode. Save the extra labor
needed for little endian machines when the target is a big endian one
(sparc).

Signed-off-by: Pengcheng Xu <i@jsteward.moe>
2017-10-12 01:36:16 +08:00
KireinaHoro 46d4fe880e SPARC optimizations for SHA1Transform()
Passing arguments explicitly into SHA1Transform() increases the number of
registers abailable to the compiler, hence leaving more local and out registers
available. The missing symbol of sha1_consts[], which prevents compiling on
SPARC, is added back, which speeds up the process of utilizing the relative
constants.
This should fix #6738.

Signed-off-by: Pengcheng Xu <i@jsteward.moe>
2017-10-12 01:36:11 +08:00
aun d4404c3fdb Fix boot from ZFS issues
* Correct ZFS snapshot listing
* Disable "lvm is not available" message on quiet boot

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alar Aun <spamtoaun@gmail.com>
Closes #6700 
Closes #6747
2017-10-11 10:06:20 -07:00
Brian Behlendorf 29e07af5ae Fix chattr/cleanup failure
The chattr cleanup step may fail to delete the user if there is still
an active process running as that user.  Retry the userdel when this
occurs to eliminate spurious false positves.

  ERROR: userdel quser1 exited 8
  userdel: user quser1 is currently used by process 26814

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6749
2017-10-11 09:15:44 -07:00
Tobin Harding 523d5ce0f4 Fix coverity defects: CID 147474
CID 147474: Logically dead code (DEADCODE)

Remove ternary operator and return `error` directly.

Currently return value is derived from a ternary operator. The
conditional is always true. The ternary operator is therefore
redundant i.e dead code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Closes #6723
2017-10-10 16:41:47 -07:00
Fabian Grünbichler 829e95c4dc Skip FREEOBJECTS for objects which can't exist
When sending an incremental stream based on a snapshot, the receiving
side must have the same base snapshot.  Thus we do not need to send
FREEOBJECTS records for any objects past the maximum one which exists
locally.

This allows us to send incremental streams (again) to older ZFS
implementations (e.g. ZoL < 0.7) which actually try to free all objects
in a FREEOBJECTS record, instead of bailing out early.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes #5699
Closes #6507
Closes #6616
2017-10-10 15:35:49 -07:00
Fabian Grünbichler 48fbb9ddbf Free objects when receiving full stream as clone
All objects after the last written or freed object are not supposed to
exist after receiving the stream.  Free them accordingly, as if a
freeobjects record for them had been included in the stream.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes #5699
Closes #6507
Closes #6616
2017-10-10 15:30:51 -07:00
LOLi aee1dd4d98 Fix intra-pool resumable 'zfs send -t <token>'
Because resuming from a token requires "guid" -> "snapshot" mapping
we have to walk the whole dataset hierarchy to find the right snapshot
to send; when both source and destination exists, for an incremental
resumable stream, libzfs gets confused and picks up the wrong snapshot
to send from: this results in attempting to send

   "destination@snap1 -> source@snap2"

instead of

   "source@snap1 -> source@snap2"

which fails with a "Invalid cross-device link" error (EXDEV).

Fix this by adjusting the logic behind dataset traversal in
zfs_iter_children() to pick the right snapshot to send from.

Additionally update dry-run 'zfs send -t' to print its output to
stderr: this is consistent with other dry-run commands.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6618
Closes #6619
Closes #6623
2017-10-10 15:22:05 -07:00
Brian Behlendorf 70f02287f8 Fix ARC behavior on 32-bit systems
With the addition of the ABD changes consumption of the virtual
address space has been greatly reduced.  This exposed an issue on
CONFIG_HIGHMEM systems where free memory was being calculated
incorrectly.  Functionally this didn't cause any major problems
prior to ABD because a lack of available virtual address space
was used as an indicator of low memory.

This patch makes the following changes to address the issue and
in the process realigns the code further with OpenZFS.  There
are no substantive changes in behavior for 64-bit systems.

* Added CONFIG_HIGHMEM case to the arc_all_memory() and
  arc_free_memory() functions to only consider low memory pages
  on CONFIG_HIGHMEM systems.

* The arc_free_memory() function was updated to return bytes
  instead of pages to be consistent with the other helper
  functions.  In user space we make up some reasonable values
  since currently only testing is performed in this context.

* Adds three new values to the arcstats kstat to provide visibility
  in to the ARC's assessment of the memory situation:
  memory_all_bytes, memory_free_bytes, and memory_available_bytes.

* Added kmem_reap() call to arc_available_memory() for 32-bit
  builds to realign code with OpenZFS.

* Reduced size of test file in /async_destroy_001_pos.ksh to
  speed up test case.  Multiple txgs are still required.

* Move vdevs used by zpool_clear_001_pos and zpool_upgrade_002_pos
  to TEST_BASE_DIR location to speed up test cases.

Reviewed-by: David Quigley <david.quigley@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #5352
Closes #6734
2017-10-10 15:19:19 -07:00
Brian Behlendorf 0cefc9dbcd Add parenthesis to btop and ptob macros
Add missing parenthesis around btop and ptob macros to ensure
operation ordering is preserved after expansion.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #660
2017-10-10 08:59:17 -07:00
privb0x23 4f23c5d0c4 Fix inclusion of libgcc_s.so on Void
On Void Linux (x86_64 musl) libgcc_s.so is located in "/usr/lib"
so it is not found by dracut and it produces an error.

Add a simple additional path check for "/usr/lib/libgcc_s.so*"
and install it in the initramfs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: privb0x23 <privb0x23@users.noreply.github.com>
Closes #6715
2017-10-09 14:34:26 -07:00
Olaf Faaland ce319db57b Make include/linux/ conform to ZFS style standard
No semantic changes.

Fix the following types of style issues:
	blank after preprocessor #
	#define followed by space instead of tab
	improper first line of block comment
	indent by spaces instead of tabs
	last line in file is blank
	missing blank after open comment
	missing space before left brace
	non-continuation indented 4 spaces
	spaces instead of tabs
	unparenthesized return expression

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
2017-10-09 14:27:27 -07:00
Olaf Faaland 4b393c50ae Make file headers conform to ZFS style standard
No semantic changes.

Change
 /************\
and
 \************/

to

 /*
and
  */

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
2017-10-09 14:27:27 -07:00
Brian Behlendorf 57f4ef2e81 Fix abdstats kstat on 32-bit systems
When decrementing the struct_size and scatter_chunk_waste kstats
the value needs to be cast to an int on 32-bit systems.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6721
2017-10-06 11:23:12 -07:00
Tobin Harding a0430cc5a9 Use bitwise '&' instead of logical '&&'
Make two instances of the same change. Change bitwise AND (&) to logical
AND (&&).

Currently the code uses a bitwise AND between two boolean values.

In the first instance;

The first operand is a flag that has been bitwise combined with a bit
mask to get a boolean value as to whether a file has group write
permissions set.

The second operand used is a struct member that is intended as a
boolean flag not a bit mask.

In the second instance the argument is the same except with world write
permissions instead of group write (S_IWOTH, S_IWGRP).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Closes #6684 
Closes #6722
2017-10-05 19:38:55 -07:00
Tobin Harding d95a59805f Remove unnecessary equality check
Currently `if` statement includes an assignment (from a function return
value) and a equality check. The parenthesis are in the incorrect place,
currently the code clobbers the function return value because of this.

We can fix this by simplifying the `if` statement.

`if (foo != 0)`

can be more succinctly expressed as

`if (foo)`

Remove the equality check, add parenthesis to correct the statement.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Tobin C. Harding <me@tobin.cc>
Closes #6685 
Close #6719
2017-10-05 19:33:44 -07:00
Isaac Huang eea2e24132 Use linear abd in vdev_copy_uberblocks()
The vdev_copy_uberblocks() function should use abd_alloc_linear() to
allocate ub_abd, because abd_to_buf(ub_abd)) is used later.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes #6718 
Closes #6713
2017-10-05 19:30:02 -07:00
Brian Behlendorf c11f1004d1 Remove dead code from AVL tree
The avl_update_* functions are never used by ZFS and are therefore
being removed.  They're barely even used in Illumos.  Additionally,
simplify avl_add() by using a VERIFY which produces exactly the same
behavior under Linux.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6716
2017-10-05 19:28:00 -07:00
Ned Bass 39f56627ae receive_freeobjects() skips freeing some objects
When receiving a FREEOBJECTS record, receive_freeobjects()
incorrectly skips a freed object in some cases. Specifically, this
happens when the first object in the range to be freed doesn't exist,
but the second object does. This leaves an object allocated on disk
on the receiving side which is unallocated on the sending side, which
may cause receiving subsequent incremental streams to fail.

The bug was caused by an incorrect increment of the object index
variable when current object being freed doesn't exist.  The
increment is incorrect because incrementing the object index is
handled by a call to dmu_object_next() in the increment portion of
the for loop statement.

Add test case that exposes this bug.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #6694 
Closes #6695
2017-10-02 15:36:04 -07:00
Alek P 01ff0d7540 Update the default for zfs_txg_history
It's often useful to have access to txg history for debugging
purposes. This patch changes the default from 0 to 100 TXGs
worth of history preserved.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #6691
2017-09-29 15:58:52 -07:00
chrisrd e71cade67d Scale the dbuf cache with arc_c
Commit d3c2ae1 introduced a dbuf cache with a default size of the
minimum of 100M or 1/32 maximum ARC size. (These figures may be adjusted
using dbuf_cache_max_bytes and dbuf_cache_max_shift.) The dbuf cache
is counted as metadata for the purposes of ARC size calculations.

On a 1GB box the ARC maximum size defaults to c_max 493M which gives a
dbuf cache default minimum size of 15.4M, and the ARC metadata defaults
to minimum 16M. I.e. the dbuf cache is an significant proportion of the
minimum metadata size. With other overheads involved this actually means
the ARC metadata doesn't get down to the minimum.

This patch dynamically scales the dbuf cache to the target ARC size
instead of statically scaling it to the maximum ARC size. (The scale is
still set by dbuf_cache_max_shift and the maximum size is still fixed by
dbuf_cache_max_bytes.) Using the target ARC size rather than the current
ARC size is done to help the ARC reach the target rather than simply
focusing on the current size.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #6506 
Closes #6561
2017-09-29 15:49:19 -07:00
LOLi b59b22972d Add 'zfs diff' coverage to the ZFS Test Suite
This change adds four new tests to the ZTS:

 * zfs_diff_changes: verify type of changes diplayed (-, +, R and M)
 * zfs_diff_cliargs: verify command line options and arguments
 * zfs_diff_timestamp: verify 'zfs diff -t'
 * zfs_diff_types: verify type of objects (files, dirs, pipes...)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Wren Kennedy <john.kennedy@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6686
2017-09-28 13:04:14 -07:00
Simon Guest 269db7a4b3 vdev_id: extension for new scsi topology
On systems with SCSI rather than SAS disk topology, this change enables
the vdev_id script to match against the block device path, and therefore
create a vdev alias in /dev/disk/by-vdev.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Simon Guest <simon.guest@tesujimath.org>
Closes #6592
2017-09-27 10:39:47 -07:00
Prakash Surya 0c484ab567 Run ztest for longer on "Coverage" builders
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #6675
2017-09-26 12:29:32 -07:00
DeHackEd 7e98073379 Fix printk() calls missing log level
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes #6672
2017-09-25 10:38:27 -07:00
LOLi 3fd3e56cfd Fix some ZFS Test Suite issues
* Add 'zfs bookmark' coverage (zfs_bookmark_cliargs)

 * Add OpenZFS 8166 coverage (zpool_scrub_offline_device)

 * Fix "busy" zfs_mount_remount failures

 * Fix bootfs_003_pos, bootfs_004_neg, zdb_005_pos local cleanup

 * Update usage of $KEEP variable, add get_all_pools() function

 * Enable history_008_pos and rsend_019_pos (non-32bit builders)

 * Enable zfs_copies_005_neg, update local cleanup

 * Fix zfs_send_007_pos (large_dnode + OpenZFS 8199)

 * Fix rollback_003_pos (use dataset name, not mountpoint, to unmount)

 * Update default_raidz_setup() to work properly with more than 3 disks

 * Use $TEST_BASE_DIR instead of hardcoded (/var)/tmp for file VDEVs

 * Update usage of /dev/random to /dev/urandom

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Issue #6086 
Closes #5658 
Closes #6143 
Closes #6421 
Closes #6627 
Closes #6632
2017-09-25 10:32:34 -07:00
Richard Elling e8474f9ad3 Pool io stat shows wlentime instead of rlentime
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Closes #652 
Closes #651
2017-09-25 10:02:24 -07:00
Olaf Faaland b33d668ddb Fix ZTS MMP tests and ztest -M behavior
Quote "$MMP_IMPORT_MSG" when it is passed as an argument, as it is a
multi-word string.  Some tests were passing when they should not have,
because the grep was only testing for the first word.

Correct the message expected when no hostid is set and the test attempts
to enable multihost.  It did not match the actual output in that
situation.

Disable ztest_reguid() when ztest is invoked with the -M option.  If
ztest performs a reguid, a concurrent import attempt may fail with the
error "one or more devices is currently unavailable" if the guid sum is
calculated on the original device guids but compared against the guid
sum ztest wrote based on the new device guids.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6666
2017-09-23 09:28:18 -07:00
Brian Behlendorf 7a6acb31b7 Fix "--enable-code-coverage" debug build
When --enable-code-coverage is provided it should not result
in NDEBUG being defined.  This is controlled by --enable-debug.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6674
2017-09-22 22:16:18 -07:00
Brian Behlendorf bb2773b358 Update codecov.yml
Update the codecov.yml to make the following functional changes.

* Do not require the CI testing to pass before posting results.
* Set red-yellow-green coverage percent from 50%-100%
* Allow a 1% drop in coverage to still be considered a pass.
* Reduce the size of the comment posted to the issue.

Additionally, the top level README.markdown has been updated
to include the codecov.io badge and the project summary reworded.

Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6669
2017-09-22 18:54:34 -07:00
Prakash Surya acf044420b Add support for "--enable-code-coverage" option
This change adds support for a new option that can be passed to the
configure script: "--enable-code-coverage". Further, the "--enable-gcov"
option has been removed, as this new option provides the same
functionality (plus more).

When using this new option the following make targets are available:

 * check-code-coverage
 * code-coverage-capture
 * code-coverage-clean

Note: these make targets can only be run from the root of the project.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #6670
2017-09-22 18:49:57 -07:00
Olaf Faaland d410c6d9fd Reimplement vdev_random_leaf and rename it
Rename it as mmp_random_leaf() since it is defined in mmp.c.

The earlier implementation could end up spinning forever if a pool had a
vdev marked writeable, none of whose children were writeable.  It also
did not guarantee that if a writeable leaf vdev existed, it would be
found.

Reimplement to recursively walk the device tree to select the leaf.  It
searches the entire tree, so that a return value of (NULL) indicates
there were no usable leaves in the pool; all were either not writeable
or had pending mmp writes.

It still chooses the starting child randomly at each level of the tree,
so if the pool's devices are healthy, the mmp writes go to random leaves
with an even distribution.  This was verified by testing using
zfs_multihost_history enabled.

Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #6631 
Closes #6665
2017-09-22 14:29:26 -07:00
Don Brady 5df5d06a8d Cleanup zloop working directory after each pass
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Don Brady <don.brady@delphix.com>
Issue #6595 
Closes #6663
2017-09-21 10:17:56 -07:00
Brian Behlendorf 4ce3c45a5e Increase default arc_c_min
Increase the default arc_c_min value to which whichever is larger,
either 32M or 1/32 of total system memory.  This is advantageous for
systems with more than 1G of memory where performance issues may
occur when the ARC is allowed to collapse below a minimum size.
At the same time we want to use the bare minimum value which is
still functional so the filesystem can be used in very low memory
environments.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6659
2017-09-20 09:36:17 -07:00
Brian Behlendorf 848259c10f Export symbol dmu_tx_mark_netfree()
This symbol is needed by Lustre for the same reason it was needed
by the ZPL.  It should have been exported when the original patch
was merged.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6660
2017-09-20 09:30:24 -07:00
Feng Sun 18a2485fc8 misc: fix meaningless values
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Feng Sun <loyou85@gmail.com>
Closes #6658
2017-09-19 12:19:08 -07:00
Giuseppe Di Natale 34d00e7aba Correct cppcheck errors
ZFS buildbot STYLE builder was moved to Ubuntu 17.04
which has a newer version of cppcheck. Handle the
new cppcheck errors.

uu_* functions removed in this commit were unused
and effectively dead code. They are now retired.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6653
2017-09-19 12:17:29 -07:00
Brian Behlendorf 8e2dddab42 ZTS fix slog_replay_volume.ksh failure
The slog_replay_volume.ksh test case will fail when the pool is
layered on files in a filesystem which does not support discard.
Avoid this issue by creating the pool using DISKS which will
either be loopback device or real disk.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6654
2017-09-19 10:09:37 -07:00
David Quigley a9a2bf7152 Remove FRU and LIBTOPO Support
FRU and LIBTOPO support are illumos only features that will not be ported to
Linux and make the code more complicated than necessary. This commit
makes way for further cleanups of the zed/FMA code.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: David Quigley <david.quigley@intel.com>
Closes #6641
2017-09-18 17:06:40 -07:00
Giuseppe Di Natale ea49beba66 Correct shellcheck errors
The ZFS buildbot moved to using Ubuntu 17.04 for the
STYLE builder which has a newer version of shellcheck.
Correct the new issues it discovers.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6647
2017-09-18 14:23:09 -07:00
Brian Behlendorf a35b4cc8cc ZTS fix events_002_pos.sh failure
Fix spurious events_002_pos failures by waiting longer before
grabbing the log to check for the resilver_finish event.  It
would be better to rework this logic to wait only as long as
needed rather than a fixed timeout.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6651
2017-09-16 19:36:44 -07:00
Giuseppe Di Natale 787acae0b5 Linux 3.14 compat: IO acct, global_page_state, etc
generic_start_io_acct/generic_end_io_acct in the master
branch of the linux kernel requires that the request_queue
be provided.

Move the logic from freemem in the spl to arc_free_memory
in arc.c. Do this so we can take advantage of global_page_state
interface checks in zfs.

Upstream kernel replaced struct block_device with
struct gendisk in struct bio. Determine if the
function bio_set_dev exists during configure
and have zfs use that if it exists.

bio_set_dev https://github.com/torvalds/linux/commit/74d4699
global_node_page_state https://github.com/torvalds/linux/commit/75ef718
io acct https://github.com/torvalds/linux/commit/d62e26b

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6635
2017-09-16 11:00:19 -07:00
LOLi 90cdf2833d Add mdoc style checker
Add a new make 'mancheck' target which uses mandoc -Tlint to verify
manpage files: currently only zfs(8), zpool(8) zdb(8) and zgenhostid(8)
are supported.

Additionally fix some outstanding manpage formatting issues.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6646
2017-09-16 10:51:24 -07:00
David Quigley 1f4e2c88fd ZTEST: Always enable asserts
The build for ztest always enabled debug information but does not enable
asserts unless --enable-debug is used. This will always enable asserts
in the ztest code.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: David Quigley <david.quigley@intel.com>
Closes #6640
2017-09-15 13:26:05 -07:00
George Melikov 7c9abcf887 OpenZFS 8435 - zpool.1m and zfs.1m: minor cleanup
3796 listsnapshots not documented in zpool man page

Authored by: George Melikov <mail@gmelikov.ru>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuripv@gmx.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: George Melikov mail@gmelikov.ru

OpenZFS-issue: https://www.illumos.org/issues/8435
OpenZFS-commit: openzfs/openzfs@a058d1c

Porting notes: OpenZFS review applied,
some ZoL changes were reverted.
See https://github.com/openzfs/openzfs/pull/415
2017-09-15 13:13:52 -07:00
Prakash Surya 6384cf4132 Make "-fno-inline" compile option more accessible
When functions are inlined, it can make the system much more difficult
to instrument using tools such as ftrace, BPF, crash, etc. Thus, to aid
development and increase the system's observability, when the
"--enable-debuginfo" flag is specified, the "-fno-inline" compilation
option will be used for both userspace and kernel modules.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #6605
2017-09-15 11:47:11 -07:00
Brian Behlendorf d9ec8b9b2a Add configure option to enable gcov analysis
* Add configure option to enable gcov analysis.
* Includes a few minor ctime fixes.
* Add codecov.yml configuration.

Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6642
2017-09-15 10:24:13 -07:00
Gaurav Kumar 0107f69898 Modifying XATTRs doesnt change the ctime
Changing any metadata, should modify the ctime.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: gaurkuma <gauravk.18@gmail.com>
Closes #3644 
Closes #6586
2017-09-13 12:20:07 -07:00
David Quigley b1490dd43e Fix bug in distclean which removes needed files
Running distclean removes the following files because of an error
in Makefile.am

deleted:    tests/zfs-tests/include/commands.cfg
deleted:    tests/zfs-tests/include/libtest.shlib
deleted:    tests/zfs-tests/include/math.shlib
deleted:    tests/zfs-tests/include/properties.shlib
deleted:    tests/zfs-tests/include/zpool_script.shlib

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: David Quigley <david.quigley@intel.com>
Closes #6636
2017-09-13 11:45:04 -07:00
LOLi ded8f06a3c Relax (ref)reservation constraints on ZVOLs
This change allow (ref)reservation to be set larger than the current
ZVOL size: this is safe as we normally set refreservation > volsize
at ZVOL creation time when we account for metadata.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #2468 
Closes #6610
2017-09-12 11:33:22 -07:00
Arkadiusz Bubała d9549cba96 Fix false config_cache_write events
On pool import when the old cache file is removed
the ereport.fs.zfs.config_cache_write event is generated.
Because zpool export always removes cache file it happens
every export - import sequence.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com>
Closes #6617
2017-09-11 10:25:01 -07:00
LOLi 835db58592 Add -vnP support to 'zfs send' for bookmarks
This leverages the functionality introduced in cf7684b to expose
verbose, dry-run and parsable 'zfs send' options for bookmarks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #3666 
Closes #6601
2017-09-08 15:24:31 -07:00
Mike Swanson 57858fb5ca Recommend compression=on in zfs(8) dedup section
compression=lz4 depends on the lz4 feature being enabled, while
compression=on will let ZFS use either lzjb or lz4 where appropriate.
It also allows the documentation to not go out of date if/when ZFS
picks a new default in the future.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mike Swanson <mikeonthecomputer@gmail.com>
Closes #6614
2017-09-08 15:21:58 -07:00
Brian Behlendorf 5c214ae318 Fix volume WR_INDIRECT log replay
The portion of the zvol_replay_write() handler responsible for
replaying indirect log records for some reason never existed.
As a result indirect log records were not being correctly replayed.

This went largely unnoticed since the majority of zvol log records
were of the type WR_COPIED or WR_NEED_COPY prior to OpenZFS 7578.

This patch updates zvol_replay_write() to correctly handle these
log records and adds a new test case which verifies volume replay
to prevent any regression.  The existing test case which verified
replay on filesystem was renamed slog_replay_fs.ksh for clarity.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6603 
Closes #6615
2017-09-08 15:07:00 -07:00
Brian Behlendorf e0dd0a32a8 Revert "Handle new dnode size in incremental..."
This reverts commit 65dcb0f67a until
a comprehensive fix is finalized.  The stricter interior dnode
detection in 4c5b89f59e and the new
test case added by this patch revealed a issue with resizing
dnodes when receiving an incremental backup stream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #6576
2017-09-07 10:00:54 -07:00
Olaf Faaland 4c5b89f59e Improved dnode allocation and dmu_hold_impl()
Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the
code, fix errors introduced by commit dbeb879 (PR #6117) interacting
badly with large dnodes, and improve performance.

* When allocating a new dnode in dmu_object_alloc_dnsize(), update the
percpu object ID for the core's metadnode chunk immediately.  This
eliminates most lock contention when taking the hold and creating the
dnode.

* Correct detection of the chunk boundary to work properly with large
dnodes.

* Separate the dmu_hold_impl() code for the FREE case from the code for
the ALLOCATED case to make it easier to read.

* Fully populate the dnode handle array immediately after reading a
block of the metadnode from disk.  Subsequently the dnode handle array
provides enough information to determine which dnode slots are in use
and which are free.

* Add several kstats to allow the behavior of the code to be examined.

* Verify dnode packing in large_dnode_008_pos.ksh.  Since the test is
purely creates, it should leave very few holes in the metadnode.

* Add test large_dnode_009_pos.ksh, which performs concurrent creates
and deletes, to complement existing test which does only creates.

With the above fixes, there is very little contention in a test of about
200,000 racing dnode allocations produced by tests 'large_dnode_008_pos'
and 'large_dnode_009_pos'.

name                            type data
dnode_hold_dbuf_hold            4    0
dnode_hold_dbuf_read            4    0
dnode_hold_alloc_hits           4    3804690
dnode_hold_alloc_misses         4    216
dnode_hold_alloc_interior       4    3
dnode_hold_alloc_lock_retry     4    0
dnode_hold_alloc_lock_misses    4    0
dnode_hold_alloc_type_none      4    0
dnode_hold_free_hits            4    203105
dnode_hold_free_misses          4    4
dnode_hold_free_lock_misses     4    0
dnode_hold_free_lock_retry      4    0
dnode_hold_free_overflow        4    0
dnode_hold_free_refcount        4    57
dnode_hold_free_txg             4    0
dnode_allocate                  4    203154
dnode_reallocate                4    0
dnode_buf_evict                 4    23918
dnode_alloc_next_chunk          4    4887
dnode_alloc_race                4    0
dnode_alloc_next_block          4    18

The performance is slightly improved for concurrent creates with
16+ threads, and unchanged for low thread counts.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #5396 
Closes #6522 
Closes #6414 
Closes #6564
2017-09-05 16:15:04 -07:00
Ned Bass 65dcb0f67a Handle new dnode size in incremental backup stream
When receiving an incremental backup stream, call
dmu_object_reclaim_dnsize() if an object's dnode size differs between
the incremental source and target. Otherwise it may appear that a
dnode which has shrunk is still occupying slots which are in fact
free. This will cause a failure to receive new objects that should
occupy the now-free slots.

Add a test case to verify that an incremental stream containing
objects with changed dnode sizes can be received without error. This
test case fails without this change.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #6366 
Closes #6576
2017-09-05 16:09:15 -07:00
Fabian-Gruenbichler c8811dec70 Add man page reference to systemd units
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes #6599
2017-09-05 13:50:35 -07:00
bunder2015 2917956841 zfs(8) manpage corrections
Corrected indent of the note located at the bottom of the options for
zfs send as well as remove an extra whitespace

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #6590
2017-09-05 13:45:18 -07:00
Brian Behlendorf e771de534f Trim new line from zfs_vdev_scheduler
Add a helper function to trim the tailing new line.  While we're
here use this new hook to immediately apply the new scheduler.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3356 
Closes #6573
2017-09-05 13:41:32 -07:00
LOLi cf7684bc8d Retire send space estimation via ZFS_IOC_SEND
Add a small wrapper around libzfs_core`lzc_send_space() to libzfs so
that every legacy ZFS_IOC_SEND consumer, along with their userland
counterpart estimate_ioctl(), can leverage ZFS_IOC_SEND_SPACE to
request send space estimation.

The legacy functionality in zfs_ioc_send() is left untouched for
compatibility purposes.

Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6029
2017-08-31 09:00:35 -07:00
Richard Lowe 1afc54f7f4 OpenZFS 2976 - remove useless offsetof() macros
Authored by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Andy Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/2976
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5c5f137
Closes #6582
2017-08-30 15:53:38 -07:00
Gvozden Neskovic d22323e89f dmu_objset: release bonus buffer in failure path
Reported by kmemleak during testing of a new patch:

```
unreferenced object 0xffff9f1c12e38800 (size 1024):
  comm "z_upgrade", pid 17842, jiffies 4296870904 (age 8746.268s)
  backtrace:
    kmemleak_alloc+0x7a/0x100
    __kmalloc_node+0x26c/0x510
    range_tree_create+0x39/0xa0 [zfs]
    dmu_zfetch_init+0x73/0xe0 [zfs]
    dnode_create+0x12c/0x3b0 [zfs]
    dnode_hold_impl+0x1096/0x1130 [zfs]
    dnode_hold+0x23/0x30 [zfs]
    dmu_bonus_hold_impl+0x6b/0x370 [zfs]
    dmu_bonus_hold+0x1e/0x30 [zfs]
    dmu_objset_space_upgrade+0x114/0x310 [zfs]
    dmu_objset_userobjspace_upgrade_cb+0xd8/0x150 [zfs]
    dmu_objset_upgrade_task_cb+0x136/0x1e0 [zfs]    
    kthread+0x119/0x150
```

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #6575
2017-08-30 12:09:18 -07:00
Eli Rosenthal 74ea6092d0 OpenZFS 7028 - avl_destroy_nodes supports emptying, not just destroying, an avl tree
Authored by: Eli Rosenthal <eli.rosenthal@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7028
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/86f617e
Closes #6583
2017-08-30 12:08:38 -07:00
Steve Dougherty de327eccbb OpenZFS 6447 - handful of nvpair cleanups
Authored by: Steve Dougherty <sdougherty@barracuda.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/6447
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/759e89b
Closes #6581
2017-08-30 12:04:27 -07:00
Andriy Gapon ecaebdbcf6 OpenZFS 5778 - nvpair_type_is_array() does not recognize DATA_TYPE_INT8_ARRAY
Authored by: Andriy Gapon <avg@icyb.net.ua>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/5778
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bf4d553
Closes #6580
2017-08-30 12:00:58 -07:00
Matthew Ahrens 24ded86e8d OpenZFS 7261 - nvlist code should enforce name length limit
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7261
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/48dd5e6
Closes #6579
2017-08-30 11:58:00 -07:00
Matthew Ahrens 006309e8d7 OpenZFS 8375 - Kernel memory leak in nvpair code
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8375
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/843c211
Closes #6578
2017-08-30 11:50:12 -07:00
alaviss 1ea8942faa libtpool: don't clone affinity if not supported
pthread_attr_(get/set)affinity_np() is glibc-only. This commit
disable the code path that use those functions in non-glibc
system. Fixes the following when building with musl:

libzfs.so: undefined reference to`pthread_attr_setaffinity_np'
libzfs.so: undefined reference to`pthread_attr_getaffinity_np'

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Leorize <alaviss@users.noreply.github.com>
Closes #6571
2017-08-29 10:17:49 -07:00
Richard Yao 0d3980acbc Implement --enable-debuginfo to force debuginfo
Inspection of a Ubuntu 14.04 x64 system revealed that the config file
used to build the kernel image differs from the config file used to
build kernel modules by the presence of CONFIG_DEBUG_INFO=y:

This in itself is insufficient to show that the kernel is built with
debuginfo, but a cursory analysis of the debuginfo provided and the
size of the kernel strongly suggests that it was built with
CONFIG_DEBUG_INFO=y while the modules were not. Installing
linux-image-$(uname -r)-dbgsym had no obvious effect on the debuginfo
provided by either the modules or the kernel.

The consequence is that issue reports from distributions such as Ubuntu
and its derivatives build kernel modules without debuginfo contain
nonsensical backtraces. It is therefore desireable to force generation
of debuginfo, so we implement --enable-debuginfo. Since the build system
can build both userspace components and kernel modules, the generic
--enable-debuginfo option will force debuginfo for both. However, it
also supports --enable-debuginfo=kernel and --enable-debuginfo=user for
finer grained control.

Enabling debuginfo for the kernel modules works by injecting
CONFIG_DEBUG_INFO=y into the make environment. This is enables
generation of debuginfo by the kernel build systems on all Linux
kernels, but the build environment is slightly different int hat
CONFIG_DEBUG_INFO has not been in the CPP. Adding -DCONFIG_DEBUG_INFO
would fix that, but it would also cause build failures on kernels where
CONFIG_DEBUG_INFO=y is already set. That would complicate its use in
DKMS environments that support a range of kernels and is therefore
undesireable. We could write a compatibility shim to enable
CONFIG_DEBUG_INFO only when it is explicitly disabled, but we forgo
doing that because it is unnecessary. Nothing in ZoL or the kernel uses
CONFIG_DEBUG_INFO in the CPP at this time and that is unlikely to
change.

Enabling debuginfo for the userspace components is done by injecting -g
into CPPFLAGS. This is not necessary because the build system honors the
environment's CPPFLAGS by appending them to the actual CPPFLAGS used,
but it is supported for consistency.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Closes #2734
2017-08-29 13:16:24 -04:00
Richard Yao 6f174823ce Make --enable-debug fail when given bogus args
Currently, bogus options to --enable-debug become --disable-debug. That
means that passing --enable-debug=true is analogous to --disable-debug,
but the result is counterintuitive. We switch to AS_CASE to allow us to
fail when given a bogus option.

Also, we modify the text printed to clarify that --enable-debug enables
assertions.

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Closes #2734
2017-08-29 13:16:04 -04:00
Matthew Ahrens 1e0457e7f5 Enhance comments for large dnode project
Fix a few nits in the comments from large dnodes. Also import
some of the commit message as a comment in the code, making
it more accessible.

Reviewed-by: @rottegift 
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matt Ahrens <mahrens@delphix.com>
Closes #6551
2017-08-29 09:00:28 -07:00
dbavatar 2209e40981 Linux 4.8+ compatibility fix for vm stats
vm_node_stat must be used instead of vm_zone_stat. Unfortunately the
old code still compiles potentially leading to silent failure of
arc_evictable_memory()

AKAMAI: CR 3816601: Regression in zfs dropcache test

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com>
Closes #6528
2017-08-24 10:48:23 -07:00
George Melikov 076e9b946e Remove copyright duplicate in zpool man page
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #6553
2017-08-24 10:36:17 -07:00
chrisrd 2fb1a234ab dbuf_cons: deduplicate multilist_link_init()
Remove harmless duplicate multilist_link_init() introduced by
commit d3c2ae1.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #6552
2017-08-24 10:31:59 -07:00
Giuseppe Di Natale d7323e79a6 OpenZFS 8547 - update mandoc to 1.14.3
Authored by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8547
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c66b804
Closes #6549
2017-08-24 10:30:42 -07:00
Alek P e4b6b2db12 OpenZFS 8414 - Implemented zpool scrub pause/resume
Authored by: Alek Pinchuk <apinchuk@datto.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Alek Pinchuk <apinchuk@datto.com>

OpenZFS-issue: https://www.illumos.org/issues/8414
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c29616076
Closes #6538
2017-08-24 10:27:20 -07:00
Tom Caputi 9b8407638d Send / Recv Fixes following b52563
This patch fixes several issues discovered after
the encryption patch was merged:

* Fixed a bug where encrypted datasets could attempt
  to receive embedded data records.

* Fixed a bug where dirty records created by the recv
  code wasn't properly setting the dr_raw flag.

* Fixed a typo where a dmu_tx_commit() was changed to
  dmu_tx_abort()

* Fixed a few error handling bugs unrelated to the
  encryption patch in dmu_recv_stream()

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #6512 
Closes #6524 
Closes #6545
2017-08-23 16:54:24 -07:00
LOLi db4c1adaf8 Add support for DMU_OTN_* types in dbufstat.py
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6535
2017-08-22 11:53:40 -07:00
Chunwei Chen 05f85a6a64 Fix zfs_ioc_pool_sync should not use fnvlist
Use fnvlist on user input would allow user to easily panic zfs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #6529
2017-08-21 13:11:11 -07:00
Gvozden Neskovic 551905dd47 vdev_mirror: kstat observables for preferred vdev
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #6461
2017-08-21 10:05:54 -07:00
Gvozden Neskovic d6c6590c5d vdev_mirror: load balancing fixes
vdev_queue:
- Track the last position of each vdev, including the io size,
  in order to detect linear access of the following zio.
- Remove duplicate `vq_lastoffset`

vdev_mirror:
- Correctly calculate the zio offset (signedness issue)
- Deprecate `vdev_queue_register_lastoffset()`
- Add `VDEV_LABEL_START_SIZE` to zio offset of leaf vdevs

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes #6461
2017-08-21 10:05:16 -07:00
Brian Behlendorf 133a5c6598 zimport.sh: Allow custom pool create options
Allow custom options to be passed to 'zpool create` when creating
a new pool.

Normally zimport.sh is intented to prevent accidentally introduced
incompatibilities so we want the default behavior.  However, when
introducing a known incompatibility with a feature flag we need a
way to disable the feature.  By adding a line like the following
to the commit message the feature can be disabled allowing the
pool to be compatibile with older versions.

TEST_ZIMPORT_CREATE_OPTIONS="-o feature@encryption=disabled"

* Additionally fix /dev/nul -> /dev/null typo and minor white space
  formating issues.

* Updated fail function to print a message and exit with 1 for use
  by the buildbot.

* Silence warnings when zlib_inflate / zlib_default modules don't
  exist.  This can happen when they're build in to the kernel.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6520
2017-08-21 10:00:12 -07:00
LOLi 9000a9fac9 Disable mount(8) canonical paths in do_mount()
By default the mount(8) command, as invoked by 'zfs mount', will try
to resolve any path parameter in its canonical form: this could lead
to mount failures when the cwd contains a symlink having the same name
of the dataset being mounted.

Fix this by explicitly disabling mount(8) path canonicalization.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #1791 
Closes #6429 
Closes #6437
2017-08-21 09:31:54 -07:00
LOLi f763c3d1df Fix range locking in ZIL commit codepath
Since OpenZFS 7578 (1b7c1e5) if we have a ZVOL with logbias=throughput
we will force WR_INDIRECT itxs in zvol_log_write() setting itx->itx_lr
offset and length to the offset and length of the BIO from
zvol_write()->zvol_log_write(): these offset and length are later used
to take a range lock in zillog->zl_get_data function: zvol_get_data().

Now suppose we have a ZVOL with blocksize=8K and push 4K writes to
offset 0: we will only be range-locking 0-4096. This means the
ASSERTion we make in dbuf_unoverride() is no longer valid because now
dmu_sync() is called from zilog's get_data functions holding a partial
lock on the dbuf.

Fix this by taking a range lock on the whole block in zvol_get_data().

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6238 
Closes #6315 
Closes #6356 
Closes #6477
2017-08-21 08:59:48 -07:00
LOLi 08de8c16f5 Fix remounting snapshots read-write
It's not enough to preserve/restore MS_RDONLY on the superblock flags
to avoid remounting a snapshot read-write: be explicit about our
intentions to the VFS layer so the readonly bit is updated correctly
in do_remount_sb().

Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6510 
Closes #6515
2017-08-17 14:28:17 -07:00
BtbN a1f3a1c05f Use /sbin/openrc-run for openrc init scripts
Using /sbin/runscript is deprecated and throws a QA warning
when still used in init scripts.

Reviewed-by: bunder2015 <omfgbunder@gmail.com>
Signed-off-by: BtbN <btbn@btbn.de>
Closes #6519
2017-08-16 15:51:51 -07:00
Brian Behlendorf c8f9061fc7 Retire legacy test infrastructure
* Removed zpios kmod, utility, headers and man page.

* Removed unused scripts zpios-profile/*, zpios-test/*,
  zpool-config/*, smb.sh, zpios-sanity.sh, zpios-survey.sh,
  zpios.sh, and zpool-create.sh.

* Removed zfs-script-config.sh.in.  When building 'make' generates
  a common.sh with in-tree path information from the common.sh.in
  template.  This file and sourced by the test scripts and used
  for in-tree testing, it is not included in the packages.  When
  building packages 'make install' uses the same template to
  create a new common.sh which is appropriate for the packaging.

* Removed unused functions/variables from scripts/common.sh.in.
  Only minimal path information and configuration environment
  variables remain.

* Removed unused scripts from scripts/ directory.

* Remaining shell scripts in the scripts directory updated to
  cleanly pass shellcheck and added to checked scripts.

* Renamed tests/test-runner/cmd/ to tests/test-runner/bin/ to
  match install location name.

* Removed last traces of the --enable-debug-dmu-tx configure
  options which was retired some time ago.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6509
2017-08-15 17:26:38 -07:00
Brian Behlendorf 70322be8dc Fix ZTS grow_pool/setup
The addition of the large_dnode_008_pos test case, which runs
right before this one, exposed some racy behavior in grow_pool
setup.sh on the Ubuntu kmemleak builder.  Before creating
partitions on a device destroying any existing ones.

  ERROR: set_partition 1  100mb loop0 exited 1

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6499 
Closes #6516
2017-08-15 16:40:04 -07:00
sckobras d49d9c2bdc vdev_id: implement slot numbering by port id
With HPE hardware and hpsa-driven SAS adapters, only a single phy is
reported, but no individual per-port phys (ie. no phy* entry below
port_dir), which breaks topology detection in the current sas_handler
code. Instead, slot information can be derived directly from the port
number. This change implements a new slot keyword "port" similar to
"id" and "lun", and assumes a default phy/port of 0 if no individual
phy entry can be found. It allows to use the "sas_direct" topology with
current HPE Dxxxx and Apollo 45xx JBODs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Daniel Kobras <d.kobras@science-computing.de>
Closes #6484
2017-08-14 15:18:26 -07:00
Don Brady d977122da9 Add corruption failure option to zinject(8)
Added a 'corrupt' error option that will flip a bit in the data
after a read operation.  This is useful for generating checksum
errors at the device layer (in a mirror config for example). It
is also used to validate the diagnosis of checksum errors from
the zfs diagnosis engine.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes #6345
2017-08-14 15:17:15 -07:00
Fabian-Gruenbichler 42a76fc8d7 dracut: make module-setup.sh shebang explicit
while these are source by dracut (which is a bash script)
the practical difference is small, but it is more correct:

/bin/sh is not bash on all systems (e.g. Debian and its
derivatives use /bin/dash as /bin/sh by default).

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes #6491
2017-08-14 10:56:04 -07:00
Tom Caputi b525630342 Native Encryption for ZFS on Linux
This change incorporates three major pieces:

The first change is a keystore that manages wrapping
and encryption keys for encrypted datasets. These
commands mostly involve manipulating the new
DSL Crypto Key ZAP Objects that live in the MOS. Each
encrypted dataset has its own DSL Crypto Key that is
protected with a user's key. This level of indirection
allows users to change their keys without re-encrypting
their entire datasets. The change implements the new
subcommands "zfs load-key", "zfs unload-key" and
"zfs change-key" which allow the user to manage their
encryption keys and settings. In addition, several new
flags and properties have been added to allow dataset
creation and to make mounting and unmounting more
convenient.

The second piece of this patch provides the ability to
encrypt, decyrpt, and authenticate protected datasets.
Each object set maintains a Merkel tree of Message
Authentication Codes that protect the lower layers,
similarly to how checksums are maintained. This part
impacts the zio layer, which handles the actual
encryption and generation of MACs, as well as the ARC
and DMU, which need to be able to handle encrypted
buffers and protected data.

The last addition is the ability to do raw, encrypted
sends and receives. The idea here is to send raw
encrypted and compressed data and receive it exactly
as is on a backup system. This means that the dataset
on the receiving system is protected using the same
user key that is in use on the sending side. By doing
so, datasets can be efficiently backed up to an
untrusted system without fear of data being
compromised.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #494 
Closes #5769
2017-08-14 10:36:48 -07:00
Chunwei Chen 376994828f Fix NULL pointer when O_SYNC read in snapshot
When doing read on a file open with O_SYNC, it will trigger zil_commit.
However for snapshot, there's no zil, so we shouldn't be doing that.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #6478 
Closes #6494
2017-08-11 08:57:54 -07:00
gaurkuma 761b8ec6bf Allow longer SPA names in stats
The pool name can be 256 chars long. Today, in /proc/spl/kstat/zfs/
the name is limited to < 32 characters. This change is to allows
bigger pool names.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: gaurkuma <gauravk.18@gmail.com>
Closes #6481
2017-08-11 08:56:24 -07:00
gaurkuma 9df9692637 Allow longer SPA names in stats
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: gaurkuma <gauravk.18@gmail.com>
Closes #641
2017-08-11 08:53:35 -07:00
Brian Behlendorf c25b8f99f8 Simplify threads, mutexs, cvs and rwlocks
* Simplify threads, mutexs, cvs and rwlocks

* Update the zk_thread_create() function to use the same trick
  as Illumos.  Specifically, cast the new pthread_t to a void
  pointer and return that as the kthread_t *.  This avoids the
  issues associated with managing a wrapper structure and is
  safe as long as the callers never attempt to dereference it.

* Update all function prototypes passed to pthread_create() to
  match the expected prototype.  We were getting away this with
  before since the function were explicitly cast.

* Replaced direct zk_thread_create() calls with thread_create()
  for code consistency.  All consumers of libzpool now use the
  proper wrappers.

* The mutex_held() calls were converted to MUTEX_HELD().

* Removed all mutex_owner() calls and retired the interface.
  Instead use MUTEX_HELD() which provides the same information
  and allows the implementation details to be hidden.  In this
  case the use of the pthread_equals() function.

* The kthread_t, kmutex_t, krwlock_t, and krwlock_t types had
  any non essential fields removed.  In the case of kthread_t
  and kcondvar_t they could be directly typedef'd to pthread_t
  and pthread_cond_t respectively.

* Removed all extra ASSERTS from the thread, mutex, rwlock, and
  cv wrapper functions.  In practice, pthreads already provides
  the vast majority of checks as long as we check the return
  code.  Removing this code from our wrappers help readability.

* Added TS_JOINABLE state flag to pass to request a joinable rather
  than detached thread.  This isn't a standard thread_create() state
  but it's the least invasive way to pass this information and is
  only used by ztest.

TEST_ZTEST_TIMEOUT=3600

Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4547 
Closes #5503 
Closes #5523 
Closes #6377 
Closes #6495
2017-08-11 08:51:44 -07:00
sanjeevbagewadi 21df134f4c zio_dva_throttle_done() should allow zinjected ZIO
If fault injection is enabled, the ZIO_FLAG_IO_RETRY could be set by
zio_handle_device_injection() to generate the FMA events and update
stats. Hence, ignore the flag and process such zios.

A better fix would be to add another flag in the zio_t to indicate that
the zio is failed because of a zinject rule. However, considering the
fact that we do this in debug bits, we could do with the crude check
using the global flag zio_injection_enabled which is set to 1 when
zinject records are added.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com>
Closes #6383 
Closes #6384
2017-08-10 15:53:40 -07:00
Fabian-Gruenbichler b58237e769 Man page fixes
* ztest.1 man page: fix typo
* zfs-module-parameters.5 man page: fix grammar

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes #6492
2017-08-10 15:45:25 -07:00
Fabian-Gruenbichler bbefaeba29 make module/spl/spl-kmem.c non-executable (again)
This was probably accidentally committed in

aeb9baa618
Fix: handle NULL case in spl_kmem_free_track()

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Closes #644
2017-08-10 15:23:43 -07:00
Fabian-Gruenbichler 945b7f1c63 spl-module-parameters.5 manpage: fix macro
There is no '.sh' macro in troff/groff/man, only '.SH' for section
headers. I assume .sp for a line break was intended here like
in the rest of the man page.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #643
2017-08-10 15:22:31 -07:00
Fabian-Gruenbichler a02fa347b7 splat.1 manpage: fix spelling of 'hexadecimal'
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #642
2017-08-10 15:21:54 -07:00
Giuseppe Di Natale 4334df5353 Disable rsend_024_pos
The test case frequently hangs on buildbot
TEST builders. Disable it for now.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6487
2017-08-10 07:53:10 -07:00
Brian Behlendorf 46364cb2f3 Add libtpool (thread pools)
OpenZFS provides a library called tpool which implements thread
pools for user space applications.  Porting this library means
the zpool utility no longer needs to borrow the kernel mutex and
taskq interfaces from libzpool.  This code was updated to use
the tpool library which behaves in a very similar fashion.

Porting libtpool was relatively straight forward and minimal
modifications were needed.  The core changes were:

* Fully convert the library to use pthreads.
* Updated signal handling.
* lmalloc/lfree converted to calloc/free
* Implemented portable pthread_attr_clone() function.

Finally, update the build system such that libzpool.so is no
longer linked in to zfs(8), zpool(8), etc.  All that is required
is libzfs to which the zcommon soures were added (which is the way
it always should have been).  Removing the libzpool dependency
resulted in several build issues which needed to be resolved.

* Moved zfeature support to module/zcommon/zfeature_common.c
* Moved ratelimiting to to module/zfs/zfs_ratelimit.c
* Moved get_system_hostid() to lib/libspl/gethostid.c
* Removed use of cmn_err() in zcommon source
* Removed dprintf_setup() call from zpool_main.c and zfs_main.c
* Removed highbit() and lowbit()
* Removed unnecessary library dependencies from Makefiles
* Removed fletcher-4 kstat in user space
* Added sha2 support explicitly to libzfs
* Added highbit64() and lowbit64() to zpool_util.c

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6442
2017-08-09 15:31:08 -07:00
Boris Protopopov 5146d802b4 zv_suspend_lock in zvol_open()/zvol_release()
Acquire zv_suspend_lock on first open and last close only.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Closes #6342
2017-08-09 11:10:47 -07:00
gaurkuma 520faf5ddc Crash in dbuf_evict_one with DTRACE_PROBE
Update the dbuf__evict__one() tracepoint so that it can safely
handle a NULL dmu_buf_impl_t pointer.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>    
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: gaurkuma <gauravk.18@gmail.com>
Closes #6463
2017-08-09 11:04:41 -07:00
Ned Bass 6a8ee4f71d Add debug log entries for failed receive records
Log contents of a receive record if an error occurs while writing
it out to the pool. This may help determine the cause when backup
streams are rejected as invalid.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #6465
2017-08-08 08:41:31 -07:00
Brian Behlendorf 9631681b75 Fix dnode allocation race
When performing concurrent object allocations using the new
multi-threaded allocator and large dnodes it's possible to
allocate overlapping large dnodes.

This case should have been handled by detecting an error
returned by dnode_hold_impl().  But that logic only checked
the returned dnp was not-NULL, and the dnp variable was not
reset to NULL when retrying.  Resolve this issue by properly
checking the return value of dnode_hold_impl().

Additionally, it was possible that dnode_hold_impl() would
misreport a dnode as free when it was in fact in use.  This
could occurs for two reasons:

* The per-slot zrl_lock must be held over the entire critical
  section which includes the alloc/free until the new dnode
  is assigned to children_dnodes.  Additionally, all of the
  zrl_lock's in the range must be held to protect moving
  dnodes.

* The dn->dn_ot_type cannot be solely relied upon to check
  the type.  When allocating a new dnode its type will be
  DMU_OT_NONE after dnode_create().  Only latter when
  dnode_allocate() is called will it transition to the new
  type.  This means there's a window when allocating where
  it can mistaken for a free dnode.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6414 
Closes #6439
2017-08-08 08:38:53 -07:00
Boris Protopopov 9243b0fb47 Add assert under lock to detect cases of dispach of a preallocated
taskq work item to more than one queue concurrently. Also, please
see discussion in zfsonlinux/zfs#3840.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Closes #609
2017-08-08 08:31:52 -07:00
Karsten Kretschmer d19a6d5c80 dracut: Install commands required for vdev_id
The vdev_id script requires awk, grep, and head.  Use dracut_install to
ensure that these commands are available in the initrd environment.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Karsten Kretschmer <kkretschmer@gmail.com>
Closes #6443
Closes #6452
2017-08-04 11:14:48 -07:00
Chunwei Chen cce83ba0ec Fix use-after-free in taskq_seq_show_impl
taskq_seq_show_impl walks the tq_active_list to show the tqent_func and
tqent_arg. However for taskq_dispatch_ent, it's very likely that the
task entry will be freed during the function call, and causes a
use-after-free bug.

To fix this, we duplicate the task entry to an on-stack struct, and
assign it instead to tqt_task. This way, the tq_lock alone will
guarantee its safety.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #638 
Closes #640
2017-08-04 09:57:58 -07:00
Chunwei Chen 6ecfd2b553 Add __divmoddi4 and __udivmoddi4 for 32-bit arch
gcc-7 seems to use __udivmoddi4 for 64-bit division on 32-bit arch. This
patch implement them so we don't get undefined reference error.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes zfsonlinux/zfs#6417 
Closes #636
2017-08-03 10:41:42 -07:00
Sen Haerens 1e1c398033 Fix zpool events scripted mode tab separator
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Sen Haerens <sen@senhaerens.be>
Closes #6444 
Closes #6445
2017-08-03 09:56:15 -07:00
LOLi b0bd8ffecd Fix parsable 'zfs get' for compressratios
This is consistent with the change introduced in bc2d809 where
'zpool get -p dedupratio' does not add a trailing "x" to the output.

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6436 
Closes #6449
2017-08-03 09:43:17 -07:00
Giuseppe Di Natale e3bdcb8ad8 Retry zfs destroy when busy in rsend tests
rsend tests in the test suite frequently create and
destroy datasets. It is possible for zfs destroy to
return an error code indicating the dataset is busy.
Simply use a log_must_busy in these cases to retry
destroying those datasets. Other fixes to rsend test
cases to avoid unmounting and remounting filesystems
and some cleanup.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6418
2017-08-03 08:57:43 -07:00
Ned Bass ecb2b7dc7f Use SET_ERROR for constant non-zero return codes
Update many return and assignment statements to follow the convention
of using the SET_ERROR macro when returning a hard-coded non-zero
value from a function. This aids debugging by recording the error
codes in the debug log.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #6441
2017-08-02 21:16:12 -07:00
Oleg Drokin 98cdcb8286 Remove misguided HAVE_MUTEX_OWNER check, take 2
It is just plain unsafe to peek inside in-kernel
mutex structure and make assumptions about what kernel
does with those internal fields like owner.

Kernel is all too happy to stop doing the expected things
like tracing lock owner once you load a tainted module
like spl/zfs that is not GPL.

As such you will get instant assertion failures like this:

  VERIFY3(((*(volatile typeof((&((&zo->zo_lock)->m_mutex))->owner) *)&
      ((&((&zo->zo_lock)->m_mutex))->owner))) ==
     ((void *)0)) failed (ffff88030be28500 == (null))
  PANIC at zfs_onexit.c:104:zfs_onexit_destroy()
  Showing stack for process 3626
  CPU: 0 PID: 3626 Comm: mkfs.lustre Tainted: P OE ------------ 3.10.0-debug #1
  Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
  Call Trace:
  dump_stack+0x19/0x1b
  spl_dumpstack+0x44/0x50 [spl]
  spl_panic+0xbf/0xf0 [spl]
  zfs_onexit_destroy+0x17c/0x280 [zfs]
  zfsdev_release+0x48/0xd0 [zfs]

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Closes #639
Closes #632
2017-08-02 20:50:27 -07:00
Gvozden Neskovic 261a3151e1 spl-mutex: fix race in mutex_exit
Prevent race on accessing kmutex_t when the mutex is
embedded in a ref counted structure.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes zfsonlinux/zfs#6401
Closes #637
2017-08-02 20:42:58 -07:00
Brian Behlendorf 549423c0d4 Revert "Remove misguided HAVE_MUTEX_OWNER check"
This reverts commit d89616fda8 which
introduced some build failures which need to be resolved before
this can be merged.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #633
2017-08-02 15:08:02 -04:00
Oleg Drokin d89616fda8 Remove misguided HAVE_MUTEX_OWNER check
It is just plain unsafe to peek inside in-kernel
mutex structure and make assumptions about what kernel
does with those internal fields like owner.

Kernel is all too happy to stop doing the expected things
like tracing lock owner once you load a tainted module
like spl/zfs that is not GPL.

As such you will get instant assertion failures like this:

  VERIFY3(((*(volatile typeof((&((&zo->zo_lock)->m_mutex))->owner) *)&
      ((&((&zo->zo_lock)->m_mutex))->owner))) == 
     ((void *)0)) failed (ffff88030be28500 == (null))
  PANIC at zfs_onexit.c:104:zfs_onexit_destroy()
  Showing stack for process 3626
  CPU: 0 PID: 3626 Comm: mkfs.lustre Tainted: P OE ------------ 3.10.0-debug #1
  Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
  Call Trace:
  dump_stack+0x19/0x1b
  spl_dumpstack+0x44/0x50 [spl]
  spl_panic+0xbf/0xf0 [spl]
  zfs_onexit_destroy+0x17c/0x280 [zfs]
  zfsdev_release+0x48/0xd0 [zfs]

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Closes #632 
Closes #633
2017-08-02 11:45:16 -07:00
Tony Hutter 6710381680 Only record zio->io_delay on reads and writes
While investigating https://github.com/zfsonlinux/zfs/issues/6425 I
noticed that ioctl ZIOs were not setting zio->io_delay correctly.  They
would set the start time in zio_vdev_io_start(), but never set the end
time in zio_vdev_io_done(), since ioctls skip it and go straight to
zio_done().  This was causing spurious "delayed IO" events to appear,
which would eventually get rate-limited and displayed as
"Missed events" messages in zed.

To get around the problem, this patch only sets zio->io_delay for read
and write ZIOs, since that's all we care about anyway.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #6425 
Closes #6440
2017-08-02 09:08:38 -07:00
Giuseppe Di Natale af0f842883 mmp_on_uberblocks: Use kstat for uberblock counts
Use kstat to get a more accurate count of uberblock updates.
Using a loop with zdb can potentially miss some uberblocks.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6407 
Closes #6419
2017-07-31 16:54:34 -07:00
LOLi c7a7601c08 Fix volmode=none property behavior at import time
At import time spa_import() calls zvol_create_minors() directly: with
the current implementation we have no way to avoid device node
creation when volmode=none.

Fix this by enforcing volmode=none directly in zvol_alloc().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #6426
2017-07-31 11:07:05 -07:00
Brian Behlendorf 1e0565d10a Fix aarch64 build
Add aarch64 to the list of architecture which do not sanitize the
LDFLAGS from the environment.  See fb963d33 for details.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6424
2017-07-29 13:25:53 -07:00
Brian Behlendorf eed143dfa6 Fix aarch64 build
Add aarch64 to the list of architecture which do not sanitize the
LDFLAGS from the environment.  See e0aacd9b for details.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #635
2017-07-29 13:24:39 -07:00
Giuseppe Di Natale c1dd2f783a Disable zfs_send_007_pos
Test case zfs_send_007_pos regularly is killed
by test-runner during zfs-tests on buildbot. Disable
it for now until further investigation can be done.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Closes #6422
2017-07-28 22:37:27 -07:00
LOLi 650258d7c7 zfs promote|rename .../%recv should be an error
If we are in the middle of an incremental 'zfs receive', the child
.../%recv will exist. If we run 'zfs promote' .../%recv, it will "work",
but then zfs gets confused about the status of the new dataset.
Attempting to do this promote should be an error.

Similarly renaming .../%recv datasets should not be allowed.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #4843 
Closes #6339
2017-07-28 14:12:34 -07:00
Andriy Gapon f06f53fa3f OpenZFS 7915 - checks in l2arc_evict could use some cleaning up
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/7915
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/836a00c
Closes #6375
2017-07-28 14:09:49 -07:00
Andriy Gapon e98b611725 OpenZFS 8373 - TXG_WAIT in ZIL commit path
Authored by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov>

OpenZFS-issue: https://www.illumos.org/issues/8373
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7f04961
Closes #6403
2017-07-28 14:08:20 -07:00
bunder2015 0f69f42b43 Correct man page generation
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes #6409
Closes #6410
2017-07-27 19:06:34 -07:00
Brian Behlendorf 1f2671b9c9 Tag spl-0.7.0
META file and changelog updated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2017-07-26 10:12:04 -07:00
Oleg Drokin 410f7ab594 Module parameter to enable spl_panic() to panic the kernel
In unattended operations it's often more useful to have node
panic and reboot when it encounters problems as opposed to
sit there indefinitely waiting for somebody to discover it.

This implements an spl_panic_crash module parameter, set it
to nonzero to cause spl_panic() to call panic().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Oleg Drokin <green@linuxhacker.ru>
Closes #634
2017-07-25 23:03:12 -07:00
LOLi cd47801828 Avoid WARN() from procfs on kstat collision
When we load a ZFS pool having spa_name equals to some existing kstat
we would have to create a duplicate entry, which procfs doesn't like.

For instance a ZFS pool named "zil" would have its kstat "txgs"
(module "zfs/zil") intalled under "/proc/spl/kstat/zfs/zil":
unfortunately we already have a kstat named "zil" (module "zfs")
installed in the same procfs location.

Avoid this issue by skipping the duplicate entry creation in procfs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #628
2017-07-24 10:52:53 -07:00
Brian Behlendorf 944117514d Linux 4.13 compat: wait queues
Commit torvalds/linux@ac6424b9
- Renamed struct wait_queue -> struct wait_queue_entry.

Commit torvalds/linux@2055da97
- Renamed wait_queue_head::task_list -> wait_queue_head::head
- Renamed wait_queue_entry::task_list -> wait_queue_entry::entry

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #629
2017-07-23 19:32:14 -07:00
Brian Behlendorf ae42190b79 Tag 0.7.0-rc5
Fifth release candidate.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2017-07-13 12:07:59 -07:00
Brian Behlendorf c93d9dff36 Don't cache the system hostid
Historically the SPL cached the system hostid the first time it
was accessed.  This was done to speed up subsequent accesses.
But in practice the system host id is rarely accessed and its
inconvenient that it doesn't promptly detect /etc/hostid
configuration changes.  Therefore, zone_get_hostid() has been
updated to always refresh the system hostid reported.

Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #626
2017-07-13 13:22:28 -04:00
Prakash Surya dfbd813ec7 Add ASSERT3B/VERIFY3B/USEC2NSEC/NSEC2USEC macros
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <prakash.surya@delphix.com>
Closes #627
2017-07-13 13:19:15 -04:00
Chunwei Chen 7a35f2b495 Fix RWSEM_SPINLOCK_IS_RAW check failed
Initialize dummy_lock to fix the build error in gcc 7.1.1 with:
  error: ‘dummy_lock’ is used uninitialized in this function

Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #622
2017-06-28 14:44:43 -07:00
Chunwei Chen ac48361c0c config: allow --with-linux without --with-linux-obj
Don't use `uname -r` to determine kernel build directory when the user
specified kernel source with --with-linux. Otherwise, the user is forced
to use --with-linux-obj even if they are the same directory, which is
very counterintuitive.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2017-05-25 10:12:50 -07:00
Chunwei Chen 3bda331ba8 Improve gitignore
Exclude Makefile.in in module/ and fix the gitignore in cmd/
Also, ignore *.patch and *.orig files

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2017-05-25 10:12:50 -07:00
Brian Behlendorf 2ded1c7eff Fix cv_timedwait timeout
Perform the already past expiration time check before updating
cvp->cv_mutex with the provided mutex.  This check only depends
on local state.  Doing it first ensures that cvp->cv_mutex will not
be updated in the timeout case or if it's ever called with an
expire_time <= now.

Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #616
2017-05-25 10:01:44 -07:00
Chunwei Chen 8f87971e1f Linux 4.12 compat: PF_FSTRANS was removed
Change SPL_FSTRANS to optionally contains PF_FSTRANS. Also, add
__spl_pf_fstrans_check for the checks specifically for PF_FSTRANS.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #614
2017-05-09 10:36:54 -07:00
Brian Behlendorf 3665037f30 Tag 0.7.0-rc4
Fourth release candidate.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2017-05-05 09:23:03 -07:00
Olaf Faaland 481762f6a9 glibc 2.25 compat: remove assert(X=Y)
The assert() related definitions in glibc 2.25 were altered to warn
about assert(X=Y) when -Wparentheses is used.  See
https://abi-laboratory.pro/tracker/changelog/glibc/2.25/log.html

lib/list.c used this construct to set the value of a magic field which
is defined only when debugging.

Replaced the assert()s with #ifndef/#endifs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #610
2017-04-03 13:33:48 -07:00
Olaf Faaland bf8abea4da Linux 4.11 compat: remove stub for __put_task_struct
Before kernel 2.6.29 credentials were embedded in task_structs, and zfs had
cases where one thread would need to refer to the credential of another thread,
forcing it to take a hold on the foreign thread's task_struct to ensure it was
not freed.

Since 2.6.29, the credential has been moved out of the task_struct into a
cred_t.

In addition, the mainline kernel originally did not export __put_task_struct()
but the RHEL5 kernel did, according to zfsonlinux/spl@e811949a57.  As of
2.6.39 the mainline kernel exports it.

There is no longer zfs code that takes or releases holds on a task_struct, and
so there is no longer any reference to __put_task_struct().

This affects the linux 4.11 kernel because the prototype for
__put_task_struct() is in a new include file (linux/sched/task.h) and so the
config check failed to detect the exported symbol.

Removing the unnecessary stub and corresponding config check.  This works on
kernels since the oldest one currently supported, 2.6.32 as shipped with
Centos/RHEL.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #608
2017-03-20 17:43:45 -07:00
Olaf Faaland 9a054d54fb Linux 4.11 compat: add linux/sched/signal.h
In Linux 4.11, torvalds/linux@2a1f062, signal handling related functions
were moved from sched.h into sched/signal.h.

Add configure checks to detect this and include the new file where
needed.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #608
2017-03-20 17:43:43 -07:00
Olaf Faaland 94b1ab2ae0 Linux 4.11 compat: vfs_getattr() takes 4 args
There are changes to vfs_getattr() in torvalds/linux@a528d35.  The new
interface is:

int vfs_getattr(const struct path *path, struct kstat *stat,
               u32 request_mask, unsigned int query_flags)

The request_mask argument indicates which field(s) the caller intends to
use.  Fields the caller does not specify via request_mask may be set in
the returned struct anyway, but their values may be approximate.

The query_flags argument indicates whether the filesystem must update
the attributes from the backing store.

This patch uses the query_flags which result in vfs_getattr behaving the same
as it did with the 2-argument version which the kernel provided before
Linux 4.11.

Members blksize and blocks are now always the same size regardless of
arch.  They match the size of the equivalent members in vnode_t.

The configure checks are modified to ensure that the appropriate
vfs_getattr() interface is used.

A more complete fix, removing the ZFS dependency on vfs_getattr()
entirely, is deferred as it is a much larger project.

Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #608
2017-03-20 17:43:39 -07:00
Brian Behlendorf e0aacd9b97 Fix powerpc build
Unlike other architectures which sanitize the LDFLAGS from the
environment in arch/<arch>/Makefile.  The powerpc Makefile
allows LDFLAGS to be passed through resulting in the following
build failure.

  /usr/bin/ld: unrecognized option '-Wl,-z,relro'

LDFLAGS is set in /usr/lib/rpm/redhat/macros by default.  Clear
the environment variable when building kmods for powerpc.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #607
2017-03-06 09:16:22 -08:00
Olaf Faaland 8d5feecacf Linux 4.11 compat: set_task_state() removed
Replace uses of set_task_state(current, STATE) with
set_current_state(STATE).

In Linux 4.11, torvalds/linux@642fa44, set_task_state() is removed.

All spl uses are of the form set_task_state(current, STATE).
set_current_state(STATE) is equivalent and has been available since
Linux 2.2.26.

Furthermore, set_current_state(STATE) is already used in about 15
locations within spl.  This change should have no impact other than
removing an unnecessary dependency.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Closes #603
2017-02-23 09:52:08 -08:00
Chunwei Chen 97048200f8 Use kernel slab for vn_cache and vn_file_cache
Resolve a false positive in the kmemleak checker by shifting to the
kernel slab.  It shows up because vn_file_cache is using KMC_KMEM
which is directly allocated using __get_free_pages, which is not
automatically tracked by kmemleak.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #599
2017-01-31 13:44:01 -08:00
David Quigley 43b857fddb Add a PAGESHIFT definition
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: David Quigley <david.quigley@intel.com>
Closes #598
2017-01-31 10:36:18 -08:00
Brian Behlendorf f5c5286daa Tag 0.7.0-rc3
Third release candidate.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2017-01-20 10:17:40 -08:00
clefru 2d4d81c485 Reimplement rt_mutex_owner to fix build with DEBUG & PREEMPT_RT_FULL
rt_mutex_owner is internal to kernel/locking/rtmutex_common.h and
inaccessible for SPL via the public kernel headers. The way of
accessing the owner has been stable since at least 3.13 ([1], [2]),
which is masking the lowest bit in the owner pointer in rt_mutex. We
do the same.

[1] http://lxr.free-electrons.com/source/kernel/locking/rtmutex_common.h?v=3.13#L99
[2] http://lxr.free-electrons.com/source/kernel/locking/rtmutex_common.h?v=4.9#L78

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Closes #593
2017-01-19 14:41:38 -08:00
George Melikov 5cb44271b4 Remove identical if statements in module/spl/spl-vnode.c
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Melikov <mail@gmelikov.ru>
Closes #594
2017-01-19 14:32:45 -08:00
Kevin Tanguy 0194e4a03c Add support for recent kmem_cache_create_usercopy
SLAB_USERCOPY flag was used to indicate PAX
not to kill copies from kernel to userland.

With recent grsecurity patchset and
CONFIG_GRKERNSEC_HIDESYM that enables
CONFIG_PAX_USERCOPY zfs would panic.

Handle newer API while keeping old one functional.

Tested-by: RageLtMan <rageltman@sempervictus>
Reviewed-by: spendergrsec <spender@grsecurity.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kevin Tanguy <kevin.tanguy@ovh.net>
Closes #595
2017-01-17 12:05:14 -08:00
RageLtMan 120faefed9 Update struct member intializers to C89
When building SPL within the kernel tree, C99 initializers cause
build failures and need to be converted to C89 as kernel CFLAGS
specify -std=gnu89.

This fix was provided by @behlendorf in #595 discussion notes and
manually implemented in the current master revision.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: RageLtMan <rageltman@sempervictus>
Closes #597
2017-01-13 14:12:42 -08:00
Clemens Fruhwirth 8e99d66b05 Add support for rw semaphore under PREEMPT_RT_FULL
The main complication from the RT patch set is that the RW semaphore
locks change such that read locks on an rwsem can be taken only by
a single thread.  All other threads are locked out. This single
thread can take a read lock multiple times though. The underlying
implementation changes to a mutex with an additional read_depth
count.

The implementation can be best understood by inspecting the RT
patch.  rwsem_rt.h and rt.c give the best insight into how RT
rwsem works. My implementation for rwsem_tryupgrade is basically
an inversion of rt_downgrade_write found in rt.c. Please see the
comments in the code.

Unfortunately, I have to drop SPLAT rwlock test4 completely as this
test tries to take multiple locks from different threads, which RT
rwsems do not support.  Otherwise SPLAT, zconfig.sh, zpios-sanity.sh
and zfs-tests.sh pass on my Debian-testing VM with the kernel
linux-image-4.8.0-1-rt-amd64.

Tested-by: kernelOfTruth <kerneloftruth@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Closes zfsonlinux/zfs#5491
Closes #589
Closes #308
2016-12-19 12:45:24 -08:00
Clemens Fruhwirth 6d064f7a07 Remove stale comment from rw_tryupgrade()
Commit f58040c0fc should have removed
this comment which is no longer relevant.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Issue #589
2016-12-19 11:27:27 -08:00
Chunwei Chen 9c9ad845ef Refactor some splat macro to function
Refactor the code by making splat_test_{init,fini}, splat_subsystem_{init,fini}
into functions. They don't have reason to be macro and it would be too bloated
to inline every call.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-12-15 11:30:11 -08:00
Chunwei Chen 71a3c9c45d Fix splat memleak
SPLAT_TEST_FINI didn't call kfree causing memleak.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
2016-12-15 11:30:11 -08:00
Chunwei Chen f200b83673 Add system_delay_taskq for long delay
Add a dedicated system_delay_taskq for long delay like spa_deadman and
zpl_posix_acl_free. This will allow us to use system_taskq in the manner of
dispatch multiple tasks and call taskq_wait_outstanding.

Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #588
2016-12-08 14:00:20 -07:00
Chunwei Chen 493492559e Limit number of tasks shown in taskq proc
To prevent holding tq_lock for too long.

Before zfsonlinux/zfs@8e71ab9, hogging delay tasks and cat /proc/spl/taskq
would easily cause a lockup. While that bug has been fixed. It's probably
still a good idea to do this just in case task lists grow too large.

Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #586
2016-12-01 11:06:27 -07:00
Ubuntu cbba714667 Add TASKQID_INVALID and TASKQID_INITIAL macros
Add the TASKQID_INVALID and TASKQID_INITIAL macros and update the
taskq implementation and test cases to use them.  This is solely
for the purposes of readability and introduces no functional change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-11-02 10:34:19 -07:00
Ubuntu 1b457bcbe5 Fix vmem_size()
Add a minimal implementation of vmem_size() which accounts for the
virtual memory usage of the SPL's kmem cache.  This functionality
is only useful on 32-bit systems with a small virtual address space.

The following assumptions are made:

  1) The major SPL consumer of virtual memory is the kmem cache.
  2) Memory allocated with vmem_alloc() is short lived and can be ignored.
  3) Allow a 4MB floor as a generous pad given normal consumption.
  4) The spl_kmem_cache_sem only contends with cache create/destroy.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-11-02 10:34:19 -07:00
Brian Behlendorf 7b25c48e6e Tag 0.7.0-rc2
Second release candidate.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-10-25 13:13:49 -07:00
Chunwei Chen ae7eda1dde Linux 4.9 compat: group_info changes
In Linux 4.9, torvalds/linux@81243ea, group_info changed from 2d array via
->blocks to 1d array via ->gid. We change the spl cred functions accordingly.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #581
2016-10-20 09:33:28 -07:00
Chunwei Chen 87063d7dc3 Fix splat-cred.c cred usage
No need to crhold current_cred(), fix possible leak in splat_cred_test2

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #556
2016-10-20 09:33:22 -07:00
Chunwei Chen 9ba3c01923 Fix crgetgroups out-of-bound and misc cred fix
init_groups has 0 nblocks, therefore calling the current crgetgroups with
init_groups would result in out-of-bound access. We fix this by returning NULL
when nblocks is 0.

Cap crgetngroups to NGROUPS_PER_BLOCK, since crgetgroups will only return
blocks[0].

Also, remove all get_group_info. The cred already holds reference on the
group_info, and cred is not mutable. So there's no reason to hold extra
reference, if we hold cred.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #556
2016-10-20 09:33:01 -07:00
tuxoko 0d26756665 Fix out-of-bound in per_cpu in spl_random_init
When iterating per_cpu values, we need to use for_each_possible_cpu. While
NR_CPUS indicates the number of CPU supported by the kernel, it might not
initialize all of them if the kernel decides it's not possible to use them.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #578
2016-10-07 20:59:46 -07:00
tuxoko 2529b3a80e Linux 4.8 compat: Fix RW_READ_HELD
Linux 4.8, starting from torvalds/linux@19c5d690e, will set owner to 1 when
read held instead of leave it NULL. So we change the condition to
`rw_owner(rwp) <= 1` in RW_READ_HELD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes zfsonlinux/zfs#5233 
Closes #577
2016-10-07 20:53:58 -07:00
Brian Behlendorf 341dfdb3fd Fix p0 initializer
Due to changes in the task_struct the following warning is occurs
when initializing the global p0.  Since this structure only exists
for it's address to be taken initialize it in a manor which isn't
sensitive to internal changes to the structure.

  module/spl/spl-generic.c:58:1: error: missing braces around
  initializer [-Werror=missing-braces]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #576
2016-10-04 17:26:36 -07:00
Brian Behlendorf 6c2a66bfa8 Fix aarch64 type warning
Explicitly cast type in splat-rwlock.c test case to silence
the following warning.

  warning: format ‘%ld’ expects argument of type ‘long int’,
  but argument N has type ‘int’

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #574
2016-10-01 18:33:01 -07:00
Brian Behlendorf 8acfb2bcc1 Fix automatically generated release number
When building from the head of a branch a release number is
automatically generated with `git describe` using the last tag
on that branch as the base.  For this to work the last tag on the
branch needs to be predictable given the current META file.

This logic was accidentally broken when an -rcX tag was added to
the branch.  Update it to search for a VERSION or VERSION-RELEASE
tag.

Reviewed-by: Chris Siebenmann <cks.git01@cs.toronto.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#5105
Closes #572
2016-09-21 13:44:32 -07:00
Brian Behlendorf cb81c0c588 Increase spl_kmem_alloc_warn limit
In order to support ABD with large blocks the spl_kmem_alloc_warn
limit needs to be increased to 64K.

A 16M block requires that pointers be stored for 4096 4K-pages
on an x86_64 system.  Each of these pointers is 8 bytes requiring
an allocation of 8*4096=32,768 bytes.  The addition of a small
header to this structure pushes the allocation over the default
32K warning threshold.

In addition, fix a small bug where MAX was used instead of MIN
when setting the default.  This ensures a reasonable limit is
still set on systems with page sizes larger then 4K.

Reviewed-by: David Quigley <david.quigley@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #571
2016-09-16 17:10:36 -07:00
legend-hua 49fbac3ace Fix spl check.sh script
Update splat_cmd to reference the correct location of the splat utility.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Liu Hua<liu.hua130@zte.com.cn>
Closes #570
2016-09-14 17:17:00 -07:00
tuxoko 4329bd5b73 Cleanup in cred.h
Remove the code that doesn't make any sense.

Reviewed-by: Brian Behlendorf <behlendorf@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes #569
2016-09-14 16:59:31 -07:00
Brian Behlendorf 4fd75d35af Tag 0.7.0-rc1
First release candidate.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-09-07 10:33:21 -07:00
GeLiXin aeb9baa618 Fix: handle NULL case in spl_kmem_free_track()
When DEBUG_KMEM_TRACKING is enabled in SPL, we keep tracking all
the buffers alloced by kmem_alloc() and kmem_zalloc().  If a NULL
pointer which indicates no track info in SPL is passed to
spl_kmem_free_track, we just ignore it.

Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#4967
Closes #567
2016-08-19 09:14:24 -07:00
Tim Chase 576865be20 Fix HAVE_MUTEX_OWNER test for kernels prior to 4.6
Recent 4.X kernels prior to 4.6 require #include of spinlock.h in
order to get the definition of __ARCH_SPIN_LOCK_UNLOCKED which is
used by DEFINE_MUTEX().

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #566
2016-08-01 12:45:08 -07:00
Nikolay Borisov 4b9dddf430 Add handling for kernel 4.7's CONFIG_TRIM_UNUSED_KSYMS
Kernel 4.7 added the option to trim the unused exported symbols. In
my testing this showed to be problematic since the PDE_DATA function
was considered unused and as such was trimmed. This in turn caused the
respective test during spl's configure stage to falsely detect that
PDE_DATA is not defined, which in turn caused build failures later.

Handle this situation by adding detection whether CONFIG_TRIM_UNUSED_KSYMS
is enabled and refuse to build against a kernel which has it enabled

Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #565
2016-08-01 12:43:01 -07:00
Nikolay Borisov fb83388387 Add gitignore entry for spl-*.o.d files
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #565
2016-08-01 12:42:55 -07:00
Brian Behlendorf b7c7008ba2 Linux 4.8 compat: rw_semaphore atomic_long_t count
For non-rwsem-spinlocks the "count" member was changed from a
"long" to "atomic_long_t" type.  A configure check has been
added to detect this change along with new versions of the
_rwsem_tryupgrade() function and RWSEM_COUNT() macro.  See
https://github.com/torvalds/linux/commit/8ee62b18 for complete
details.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #563
2016-07-29 14:17:53 -07:00
Tom Caputi d2f97b2a26 Added highbit() and lowbit() macros
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #562
2016-07-20 10:28:46 -07:00
Tony Hutter 5ad98ad097 Add _ALIGNMENT_REQUIRED to isa_defs.h for checksums
_ALIGNMENT_REQUIRED needs to be #defined in isa_defs.h in order to
port the Illumos checksum code to ZoL:

4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
OpenZFS-issue: https://www.illumos.org/issues/4185
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee

Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #561
2016-06-21 13:37:04 -07:00
Jinshan Xiong 16fc1ec3ba Improve spl slab cache alloc
The policy is to try to allocate with KM_NOSLEEP, which will lead to
memory allocation with GFP_ATOMIC, and if it fails, it will launch
an taskq to expand slab space.

This way it should be able to get better NUMA memory locality and
reduce the overhead of context switch.

Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #551
2016-06-01 10:26:42 -07:00
Chunwei Chen ea5f1a200b Fix use-after-free in splat_taskq_test7
This splat_vprint is using tq_arg->name after tq_arg is freed.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #557
2016-05-31 11:58:42 -07:00
Chunwei Chen f58040c0fc Implement a proper rw_tryupgrade
Current rw_tryupgrade does rw_exit and then rw_tryenter(RW_RWITER), and then
does rw_enter(RW_READER) if it fails. This violate the assumption that
rw_tryupgrade should be atomic and could cause extra contention or even lock
inversion.

This patch we implement a proper rw_tryupgrade. For rwsem-spinlock, we take
the spinlock to check rwsem->count and rwsem->wait_list. For normal rwsem, we
use cmpxchg on rwsem->count to change the value from single reader to single
writer.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes zfsonlinux/zfs#4692
Closes #554
2016-05-31 11:44:15 -07:00
YunQiang Su c60a51b640 Add isa_defs for MIPS
GCC for MIPS only defines _LP64 when 64bit,
while no _ILP32 defined when 32bit.

Signed-off-by: YunQiang Su <syq@debian.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #558
2016-05-31 09:05:56 -07:00
Chunwei Chen b3a22a0a00 Fix taskq_wait_outstanding re-evaluate tq_next_id
wait_event is a macro, so the current implementation will cause re-
evaluation of tq_next_id every time it wakes up. This would cause
taskq_wait_outstanding(tq, 0) to be equivalent to taskq_wait(tq)

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #553
2016-05-24 13:02:10 -07:00
Chunwei Chen 5ce028b0d4 Fix race between taskq_destroy and dynamic spawning thread
While taskq_destroy would wait for dynamic_taskq to finish its tasks, but it
does not implies the thread being spawned is up and running. This will cause
taskq to be freed before the thread can exit.

We fix this by using tq_nspawn to indicate how many threads are being spawned
before they are inserted to the thread list. And have taskq_destroy to wait
for it to drop to zero.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #553
Closes #550
2016-05-24 13:00:17 -07:00
Chunwei Chen 872e0cc9c7 Restore CALLOUT_FLAG_ABSOLUTE in cv_timedwait_hires
In 39cd90e, I mistakenly disabled the ability of using absolute expire time in
cv_timedwait_hires. I don't quite sure why I did that, so let's restore it.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #553
2016-05-24 12:58:49 -07:00
Chunwei Chen fdbc1ba99d Linux 4.7 compat: inode_lock() and friends
Linux 4.7 changes i_mutex to i_rwsem, and we should used inode_lock and
inode_lock_shared to do exclusive and shared lock respectively.

We use spl_inode_lock{,_shared}() to hide the difference. Note that on older
kernel you'll always take an exclusive lock.

We also add all other inode_lock friends. And nested users now should
explicitly call spl_inode_lock_nested with correct subclass.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#4665
Closes #549
2016-05-20 11:00:14 -07:00
Chunwei Chen 39cd90ef08 Add cv_timedwait_sig_hires to allow interruptible sleep
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #548
2016-05-12 14:54:15 -07:00
David Quigley 5e39e4f0b2 Add a macro to convert seconds to nanoseconds and vice-versa
Required infrastructure for zfsonlinux/zfs#4600.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #546
2016-05-05 16:10:46 -07:00
Tim Chase ea2633ad26 Clear PF_FSTRANS over spl_filp_fallocate()
The problem described in 2a5d574 also applies to XFS's file or inode
fallocate method.  Both paths may trigger writeback and expose this
issue, see the full stack below.

When layered on XFS a warning will be emitted under CentOS7 when entering
either the file or inode fallocate method with PF_FSTRANS already set.
To avoid triggering this error PF_FSTRANS is cleared and then reset
in vn_space().

WARNING: at fs/xfs/xfs_aops.c:982 xfs_vm_writepage+0x58b/0x5d0

Call Trace:
 [<ffffffff810a1ed5>] warn_slowpath_common+0x95/0xe0
 [<ffffffff810a1f3a>] warn_slowpath_null+0x1a/0x20
 [<ffffffffa0231fdb>] xfs_vm_writepage+0x58b/0x5d0 [xfs]
 [<ffffffff81173ed7>] __writepage+0x17/0x40
 [<ffffffff81176f81>] write_cache_pages+0x251/0x530
 [<ffffffff811772b1>] generic_writepages+0x51/0x80
 [<ffffffffa0230cb0>] xfs_vm_writepages+0x60/0x80 [xfs]
 [<ffffffff81177300>] do_writepages+0x20/0x30
 [<ffffffff8116a5f5>] __filemap_fdatawrite_range+0xb5/0x100
 [<ffffffff8116a6cb>] filemap_write_and_wait_range+0x8b/0xd0
 [<ffffffffa0235bb4>] xfs_free_file_space+0xf4/0x520 [xfs]
 [<ffffffffa023cbce>] xfs_file_fallocate+0x19e/0x2c0 [xfs]
 [<ffffffffa036c6fc>] vn_space+0x3c/0x40 [spl]
 [<ffffffffa0434817>] vdev_file_io_start+0x207/0x260 [zfs]
 [<ffffffffa047170d>] zio_vdev_io_start+0xad/0x2d0 [zfs]
 [<ffffffffa0474942>] zio_execute+0x82/0xe0 [zfs]
 [<ffffffffa036ba7d>] taskq_thread+0x28d/0x5a0 [spl]
 [<ffffffff810c1777>] kthread+0xd7/0xf0
 [<ffffffff8167de2f>] ret_from_fork+0x3f/0x70

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Closes zfsonlinux/zfs#4529
2016-04-26 11:22:43 -07:00
Tim Chase 3bf657b90c Use vmem_free() in dfl_free() and add dfl_alloc()
This change was lost, somehow, in e5f9a9a.  Since the arrays can be
rather large, they need to be allocated with vmem_zalloc() via dfl_alloc()
and freed with vmem_free() via dfl_free().

The new dfl_alloc() function should be used to allocate object of type
dkioc_free_list_t in order that they're allocated from vmem.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Nikolay Borisov <kernel@kyup.com>
Closes #543
2016-04-26 11:20:14 -07:00
Chunwei Chen cdd39dd245 Use kernel provided mutex owner
To reduce mutex footprint, we detect the existence of owner in kernel mutex,
and rely on it if it exists.

Note that before Linux 3.0, mutex owner is of type thread_info. Also note
that, in Linux 3.18, the condition for owner is changed from
CONFIG_DEBUG_MUTEXES || CONFIG_SMP to
CONFIG_DEBUG_MUTEXES || CONFIG_MUTEX_SPIN_ON_OWNER

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #540
2016-04-25 17:04:07 -07:00
Dimitri John Ledkov 224817e2a8 Add support for s390[x].
Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #537
2016-03-17 09:54:49 -07:00
Tim Chase 7bb5d92de8 Allow spawning a new thread for TQ_NOQUEUE dispatch with dynamic taskq
When a TQ_NOQUEUE dispatch is done on a dynamic taskq, allow another
thread to be spawned.  This will cause TQ_NOQUEUE to behave similarly
as it does with non-dynamic taskqs.

Add support for TQ_NOQUEUE to taskq_dispatch_ent().

Signed-off-by: Tim Chase <tim@onlight.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #530
2016-03-17 09:52:35 -07:00
Brian Behlendorf a6ae97caed Add rw_tryupgrade()
This implementation of rw_tryupgrade() behaves slightly differently
from its counterparts on other platforms.  It drops the RW_READER lock
and then acquires the RW_WRITER lock leaving a small window where no
lock is held.  On other platforms the lock is never released during
the upgrade process.  This is necessary under Linux because the kernel
does not provide an upgrade function.

There are currently no callers in the ZFS code where this change in
behavior is a problem.  In fact, in most cases the code is already
written such that if the upgrade fails the RW_READER lock is dropped
and the caller blocks waiting to acquire the lock as RW_WRITER.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Matthew Thode <prometheanfire@gentoo.org>
Closes zfsonlinux/zfs#4388
Closes #534
2016-03-10 13:05:25 -08:00
Brian Behlendorf 47f9824781 Remove RPM package restriction
ZFS on Linux is regularly tested on arm, ppc, ppc64, i686 and x86_64
architectures.  Given this the artificial architecture restriction in
the packaging has been removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2016-03-10 09:19:08 -08:00
Tom Caputi 18d2f56176 Changes to support zfs encryption
Unused modlinkage struct removed and ntohll functions added.

Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #533
2016-02-25 11:42:46 -08:00
Richard Yao 0b43696e66 random_get_pseudo_bytes() need not provide cryptographic strength entropy
Perf profiling of dd on a zvol revealed that my system spent 3.16% of
its time in random_get_pseudo_bytes(). No SPL consumers need
cryptographic strength entropy, so we can reduce our overhead by
changing the implementation to utilize a fast PRNG.

The Linux kernel did not export a suitable PRNG function until it
exported get_random_int() in Linux 3.10. While we could implement an
autotools check so that we use it when it is available or even try to
access the symbol on older kernels where it is not exported using the
fact that it is exported on newer ones as justification, we can instead
implement our own pseudo-random data generator. For this purpose, I have
written one based on a 128-bit pseudo-random number generator proposed
in a paper by Sebastiano Vigna that itself was based on work by the late
George Marsaglia.

http://vigna.di.unimi.it/ftp/papers/xorshiftplus.pdf

Profiling the same benchmark with an earlier variant of this patch that
used a slightly different generator (roughly same number of
instructions) by the same author showed that time spent in
random_get_pseudo_bytes() dropped to 0.06%. That is a factor of 50
improvement. This particular generator algorithm is also well known to
be fast:

http://xorshift.di.unimi.it/#speed

The benchmark numbers there state that it runs at 1.12ns/64-bits or 7.14
GBps of throughput on an Intel Core i7-4770 in what is presumably a
single-threaded context. Using it in `random_get_pseudo_bytes()` in the
manner I have will probably not reach that level of performance, but it
should be fairly high and many times higher than the Linux
`get_random_bytes()` function that we use now, which runs at 16.3 MB/s
on my Intel Xeon E3-1276v3 processor when measured by using dd on
/dev/urandom.

Also, putting this generator's seed into per-CPU variables allows us to
eliminate overhead from both spin locks and CPU memory barriers, which
is NUMA friendly.

We could have alternatively modified consumers to use something like
`gethrtime() % 3` as suggested by both Matthew Ahrens and Tim Chase, but
that has a few potential problems that this approach avoids:

1. Switching to `gethrtime() % 3` in hot code paths today requires
diverging from illumos-gate and does nothing about potential future
patches from illumos-gate that call our slow `random_get_pseudo_bytes()`
in different hot code paths. Reimplementing `random_get_pseudo_bytes()`
with a per-CPU PRNG avoids both of those things entirely, which means
less work for us in the future.

2.  Looking at the code that implements `gethrtime()`, I think it is
unlikely to be faster than this per-CPU PRNG implementation of
`random_get_pseudo_bytes()`. It would be best to go with something fast
now so that there is no point in revisiting this from a performance
perspective.

3. `gethrtime() % 3` can vary in behavior from system to system based on
kernel version, architecture and clock source. In comparison, this
per-CPU PRNG is about ~40 lines of code in `random_get_pseudo_bytes()`
that should behave consistently across all systems regardless of kernel
version, system architecture or machine clock source. It is unlikely
that we would ever need to revisit this per-CPU PRNG while the same
cannot be said for `gethrtime() % 3`.

4. `gethrtime()` uses CPU memory barriers and maybe atomic instructions
depending on the clock source, so replacing `random_get_pseudo_bytes()`
with `gethrtime()` in hot code paths could still require a future person
working on NUMA scalability to reimplement it anyway while this per-CPU
PRNG would not by virtue of using neither CPU memory barriers nor atomic
instructions. Note that I did not check various clock sources for the
presence of atomic instructions. There is simply too much code to read
and given the drawbacks versus this per-cpu PRNG, there is no point in
being certain.

5. I have heard of instances where poor quality pseudo-random numbers
caused problems for HPC code in ways that took more than a year to
identify and were remedied by switching to a higher quality source of
pseudo-random numbers. While filesystems are different than HPC code, I
do not think it is impossible for us to have instances where poor
quality pseudo-random numbers can cause problems. Opting for a well
studied PRNG algorithm that passes tests for statistical randomness over
changing callers to use `gethrtime() % 3` bypasses the need to think
about both whether poor quality pseudo-random numbers can cause problems
and the statistical quality of numbers from `gethrtime() % 3`.

6. `gethrtime()` calls `getrawmonotonic()`, which uses seqlocks. This is
probably not a huge issue, but anyone using kgdb would never be able to
step through a seqlock critical section, which is not a problem either
now or with the per-CPU PRNG:

https://en.wikipedia.org/wiki/Seqlock

The only downside that I can see is that this code's memory requirement
is O(N) where N is NR_CPUS, versus the current code and `gethrtime() %
3`, which are O(1), but that should not be a problem. The seeds will use
64KB of memory at the high end (i.e `NR_CPU == 4096`) and 16 bytes of
memory at the low end (i.e. `NR_CPU == 1`).  In either case, we should
only use a few hundred bytes of code for text, especially since
`spl_rand_jump()` should be inlined into `spl_random_init()`, which
should be removed during early boot as part of "Freeing unused kernel
memory". In either case, the memory requirements are minuscule.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #372
2016-02-17 09:49:09 -08:00
Chunwei Chen 8f3b403a73 Allow kicking a taskq to spawn more threads
This patch add a module parameter spl_taskq_kick. When writing non-zero value
to it, it will scan all the taskq, if a taskq contains a task pending for more
than 5 seconds, it will be forced to spawn a new thread. This is use as an
emergency recovery from deadlock, not a general solution.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #529
2016-02-05 14:08:31 -08:00
Chip Parker d112232f5e Ensure spl/ only occurs once in core-y
Update copy-builtin so it may be run multiple times against
the kernel source tree.  This change makes sed more discriminating
to ensure spl/ only occurs once in core-y.

Signed-off-by: Chip Parker <aparker@enthought.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #526
2016-01-26 11:54:24 -08:00
Brian Behlendorf 6b38e7510f Remove RLIM64_INFINITY assert in vn_rdwr()
Previous commit be29e6a updated kobj_read_file() so it no longer
unconditionally passes RLIM64_INFINITY.  The vn_rdwr() function
needs to be updated accordingly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #513
2016-01-23 11:16:23 -08:00
Richard Yao be29e6a6e6 kobj_read_file: Return -1 on vn_rdwr() error
I noticed that the SPL implementation of kobj_read_file is not correct
after comparing it with the userland implementation of kobj_read_file()
in zfsonlinux/zfs#4104.

Note that we no longer pass RLIM64_INFINITY with this, but our vn_rdwr
implementation did not support it anyway, so there is no difference.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #513
2016-01-23 10:10:44 -08:00
Olaf Faaland 7323da1b2f Create spl-kmod-debuginfo rpm with redhat spec file
Correct the redhat specfile so that working debuginfo rpms are created
for the kernel modules.  The generic specfile already does the right
thing.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#4224
2016-01-21 11:22:02 -08:00
Chunwei Chen 16522ac290 Use tsd to store tq for taskq_member
To prevent taskq_member holding tq_lock and doing linear search, thus causing
contention. We store the taskq pointer to which the thread belongs in tsd.
This way taskq_member will not need to touch tq_lock, and tsd has per slot
spinlock. So the contention should be reduced greatly.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #500
Closes #504
Closes #505
2016-01-20 13:07:45 -08:00
Brian Behlendorf de77e24590 Linux 4.5 compat: pfn_t typedef
The pfn_t typedef was inherited from Illumos but never directly
used by any SPL consumers.  This didn't cause any issues until
the Linux 4.5 kernel introduced a typedef of the same name.
See torvalds/linux/commit/34c0fd54, this patch removes the
unused Illumos version to prevent a conflict.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #524
2016-01-20 11:39:18 -08:00
Chunwei Chen d348f23a6a Turn on both PF_FSTRANS and PF_MEMALLOC_NOIO in spl_fstrans_mark
In b4ad50a, we abandoned memalloc_noio_save in favor of spl_fstrans_mark
because earlier kernel with it doesn't turn off __GFP_FS. However, for newer
kernel, we would prefer PF_MEMALLOC_NOIO because it would work for allocation
in kernel which we cannot control otherwise. So in this patch, we turn on both
PF_FSTRANS and PF_MEMALLOC_NOIO in spl_fstrans_mark.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #523
2016-01-20 11:38:31 -08:00
Chunwei Chen e843553d03 Don't hold mutex until release cv in cv_wait
If a thread is holding mutex when doing cv_destroy, it might end up waiting a
thread in cv_wait. The waiter would wake up trying to aquire the same mutex
and cause deadlock.

We solve this by move the mutex_enter to the bottom of cv_wait, so that
the waiter will release the cv first, allowing cv_destroy to succeed and have
a chance to free the mutex.

This would create race condition on the cv_mutex. We use xchg to set and check
it to ensure we won't be harmed by the race. This would result in the cv_mutex
debugging becomes best-effort.

Also, the change reveals a race, which was unlikely before, where we call
mutex_destroy while test threads are still holding the mutex. We use
kthread_stop to make sure the threads are exit before mutex_destroy.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue zfsonlinux/zfs#4166
Issue zfsonlinux/zfs#4106
2016-01-12 15:18:44 -08:00
Brian Behlendorf d297a5a3a1 Add spl_kmem_cache_kmem_threads man page entry
The spl_kmem_cache_kmem_threads module option was accidentally
omitted from the documentation.  Add it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #512
2016-01-12 15:04:37 -08:00
Alex McWhirter 466bcf3be5 _ILP32 is always defined on SPARC
Signed-off-by: Alex McWhirter <alexmcwhirter@triadic.us>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #520
2016-01-08 11:59:38 -08:00
Brian Behlendorf 2a552736b7 Fix do_div() types in condvar:timeout
The do_div() macro expects unsigned types and this is detected in
powerpc implementation of do_div().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #516
2015-12-22 10:32:35 -08:00
Chunwei Chen b4ad50ac5f Use spl_fstrans_mark instead of memalloc_noio_save
For earlier versions of the kernel with memalloc_noio_save, it only turns
off __GFP_IO but leaves __GFP_FS untouched during direct reclaim. This
would cause threads to direct reclaim into ZFS and cause deadlock.

Instead, we should stick to using spl_fstrans_mark. Since we would
explicitly turn off both __GFP_IO and __GFP_FS before allocation, it
will work on every version of the kernel.

This impacts kernel versions 3.9-3.17, see upstream kernel commit
torvalds/linux@934f307 for reference.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #515
Issue zfsonlinux/zfs#4111
2015-12-18 13:24:52 -08:00
Tim Chase 200366f23f Provide kstat for taskqs
This patch provides 2 new kstats to display task queues:

  /proc/spl/taskqs-all - Display all task queues
  /proc/spl/taskqs - Display only "active" task queues

A task queue is considered to be "active" if it currently has active
(running) threads or if any of its pending, priority, delay or waitq
lists are not empty.

If the task queue has running threads, displays each thread function's
address (symbolically, if possibly) and its argument.

If the task queue has a non-empty list of pending, priority or delayed
task queue entries (taskq_ent_t), displays each entry's thread function
address and arguemnt.

If the task queue has any waiters, displays each waiting task's pid.

Note: This patch also updates some comments in taskq.h which referred to
"taskq_t" when they should have referred to "taskq_ent_t".

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #491
2015-12-16 09:35:22 -08:00
Kamil Domanski e0ed96fa43 Skip GPL-only symbols test when cross-compiling
Signed-off-by: Kamil Domański <kamil@domanski.co>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/spl#507
Closes zfsonlinux/zfs#4075
2015-12-14 10:40:33 -08:00
Brian Behlendorf cb877e0ff2 Revert "Skip GPL-only symbols test when cross-compiling"
This reverts commit 61bbbd9a77 because
older versions of autoconf (2.63) do not support the cross-compile
argument to AC_RUN_IFELSE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #507
2015-12-11 17:07:00 -08:00
Brian Behlendorf 2c4332cf79 Fix cstyle issues in spl-taskq.c and taskq.h
This patch only addresses the issues identified by the style checker.
It contains no functional changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-12-11 16:20:22 -08:00
Chunwei Chen 066b89e685 Don't use tq->tq_lock_flags
The flags argument in spin_lock_irqsave is modified out side of spin_lock
context. We cannot use a shared variable like tq->tq_lock_flags for them. This
patch removes it and uses local variable for the flags.

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #506
2015-12-11 16:20:03 -08:00
Olaf Faaland 326172d854 Subclass tq_lock to eliminate a lockdep warning
When taskq_dispatch() calls taskq_thread_spawn() to create a new thread
for a taskq, linux lockdep warns of possible recursive locking.  This is
a false positive.

One such call chain is as follows, when a taskq needs more threads:
	taskq_dispatch->taskq_thread_spawn->taskq_dispatch

The initial taskq_dispatch() holds tq_lock on the taskq that needed more
worker threads.  The later call into taskq_dispatch() takes
dynamic_taskq->tq_lock.  Without subclassing, lockdep believes these
could potentially be the same lock and complains.  A similar case occurs
when taskq_dispatch() then calls task_alloc().

This patch uses spin_lock_irqsave_nested() when taking tq_lock, with one
of two new lock subclasses:

subclass              taskq
TQ_LOCK_DYNAMIC       dynamic_taskq
TQ_LOCK_GENERAL       any other

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480
2015-12-11 16:19:56 -08:00
Olaf Faaland 628fc52137 Fix lockdep warning in spl_inode_{lock,unlock}
spl_inode_{lock,unlock} are triggering possible recursive locking
warnings from lockdep.  The warning is a false positive.

The lock is used to protect a parent directory during delete/add
operations, used in zfs when writing/removing the cache file.  The inode
lock is taken on both the parent inode and the file inode.

VFS provides an enum to subclass the lock.  This patch changes the
spin_lock call to _nested version and uses the provided enum.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480
2015-12-11 16:19:47 -08:00
Olaf Faaland 692ae8d398 Add new lock types MUTEX_NOLOCKDEP, and RW_NOLOCKDEP
When running a kernel with CONFIG_LOCKDEP=y, lockdep reports possible
recursive locking in some cases and possible circular locking dependency
in others, within the SPL and ZFS modules.

When lockdep detects these conditions, it disables further lock analysis
for all locks.  This causes /proc/lock_stats not to reflect full
information about lock contention, even in locks without dependency
issues.

This commit creates a new type of mutex, MUTEX_NOLOCKDEP.  This mutex
type causes subsequent attempts to take or release those locks to be
wrapped in lockdep_off() and lockdep_on().

This commit also creates an RW_NOLOCKDEP type analagous to
MUTEX_NOLOCKDEP.

MUTEX_NOLOCKDEP and RW_NOLOCKDEP are also defined in zfs, in a commit to
that repo, for userspace builds.

Signed-off-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #480
2015-12-11 16:18:54 -08:00
Kamil Domański 61bbbd9a77 Skip GPL-only symbols test when cross-compiling
This test depends on being able to execute the resulting binary
which will be impossible when cross-compiling.  Instead make a
worst case assumption which allows the build to continue as
recommended by the autoconf manual.

https://www.gnu.org/savannah-checkouts/gnu/autoconf/manual/autoconf-2.69/html_node/Runtime.html

Signed-off-by: Kamil Domański <kamil@domanski.co>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tuxoko <tuxoko@gmail.com>
Closes zfsonlinux/spl#507
Closes zfsonlinux/zfs#4075
2015-12-11 15:27:56 -08:00
zgock 0da84d1574 Fix build issue on some configured kernels
The SPL fails to build with some "Configured" kernels (ex. openSUSE
xen Kernel) this change should make same binaries with C compiler
optimization.

Signed-off-by: zgock <zgock@nuc.base.zgock-lab.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #510
2015-12-11 15:27:53 -08:00
Brian Behlendorf 225c110675 Either _ILP32 or _LP64 must be defined
For some arm, powerpc, and sparc platforms it was possible that
neither _ILP32 of _LP64 would be defined.  Update the isa_defs.h
header to explicitly set these macros and generate a compile error
in the case neither are defined.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tuxoko <tuxoko@gmail.com>
Issue zfsonlinux/zfs#4048
2015-12-10 11:53:29 -08:00
Brian Behlendorf c5a8b1e163 Revert "Make taskq_member() use ->journal_info"
This reverts commit a430c11f0b.  Using
journal_info like this can cause a BUG at kernel fs/jbd2/transaction.c:425!

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #500
2015-12-08 17:12:36 -08:00
Richard Yao a430c11f0b Make taskq_member() use ->journal_info
The ->journal_info pointer in the task_struct is reserved for use by
filesystems and because the kernel can have multiple file systems on the
same stack due to direct reclaim, each filesystem that touches
->journal_info in a callback function will save the value at the start
of its frame and restore it at the end of its frame.  This allows us to
safely use ->journal_info to store a pointer to the taskq's struct in
taskq threads so that ZFS code paths can detect the presence of a taskq.
This could break if the ZFS code were to use taskq_member from the
context of direct reclaim. However, there are no such uses of it in that
manner, so this is safe.

This eliminates an O(N) list traversal under a spinlock with an O(1)
unlocked pointer comparison.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: tuxoko <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #500
2015-12-08 13:24:47 -08:00
Richard Yao 1683e75edc Fix race between getf() and areleasef()
If a vnode is released asynchronously through areleasef(), it is
possible for the user process to reuse the file descriptor before
areleasef is called. When this happens, getf() will return a stale
reference, any operations in the kernel on that file descriptor will
fail (as it is closed) and the operations meant for that fd will
never occur from userspace's perspective.

We correct this by detecting this condition in getf(), doing a putf
on the old file handle, updating the file descriptor and proceeding
as if everything was fine. When the areleasef() is done, it will
harmlessly decrement the reference counter on the Illumos file handle.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #492
2015-12-03 15:44:47 -08:00
tuxoko d28c5c4f04 Prevent rm modules.* when make install
This was originally in e80cd06b8e, but somehow
was changed and not working anymore. And it will cause the following error:

modprobe: ERROR: ../libkmod/libkmod.c:506 lookup_builtin_file() could not open builtin file '/lib/modules/4.2.0-18-generic/modules.builtin.bin'

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #501
2015-12-02 14:38:20 -08:00
Boris Protopopov 5578f58bdc Add a script to display SPL slab cache statistics
Useful when looking for the info on ZFS/SPL related memory consumption.

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #460
2015-12-02 14:08:08 -08:00
Tim Chase e5f9a9afd2 Additional dkio support for TRIM/Discard
Replace DKIOCTRIM with DKIOCFREE and add additional support required
for Nextenta's TRIM support.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #469
2015-12-02 13:44:35 -08:00
Dimitri John Ledkov 9f456111ab spl-kmem-cache: include linux/prefetch.h for prefetchw()
This is needed for architectures that do not have a builtin prefetchw()

Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #502
2015-12-02 12:45:06 -08:00
Brian Behlendorf 4e6f996cdd Fix --enable-linux-builtin
Adding VPATH support, commit 37d7cd9, required that a `src`
and `obj` line be added to the top of the Makefiles.  They
must be removed from the Makefiles when builtin.

The code which adds the `spl/` directory to the top level
Makefile was failing due to the addition of the `certs/` path.
The search pattern has been adjusted to be more tolerant.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #481
Issue #498
2015-12-02 07:52:51 -08:00
Brian Behlendorf e7b75d9b46 Limit maximum object size in kmem tests
Limit the maximum object size to 1/128 of total system memory for
the kmem cache tests.  Large values can result in out of memory errors
for systems with less the 512M of memory.  Additionally, use the
known number of objects per-slab for calculating the number of
objects to use for a test.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-11-16 15:02:24 -08:00
loli10K 31f24932a4 Remove superfluous newline character
Remove superfluous `newline` character from spl_kmem_cache_magazine_size
module parameter description.

Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #499
2015-11-13 15:27:45 -08:00
Jason Zaman 8fc851b7b5 sysmacros: Make P2ROUNDUP not trigger int overflow
The original P2ROUNDUP and P2ROUNDUP_TYPED macros contain -x which
triggers PaX's integer overflow detection for unsigned integers.
Replace the macros with an equivalent version that does not trigger
the overflow.

Axioms:
A. (-(x)) === (~((x) - 1)) === (~(x) + 1) under two's complement.
B. ~(x & y) === ((~(x)) | (~(y))) under De Morgan's law.
C. ~(~x) === x under the law of excluded middle.

Proof:
0. (-(-(x) & -(align))) original
1. (~(-(x) & -(align)) + 1) by A
2. (((~(-(x))) | (~(-(align)))) + 1) by B
3. (((~(~((x) - 1))) | (~(~((align) - 1)))) + 1) by A
4. (((((x) - 1)) | (((align) - 1))) + 1) by C
Q.E.D.

Signed-off-by: Jason Zaman <jason@perfinion.com>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#2505
Closes #488
2015-11-13 15:21:52 -08:00
tuxoko f5f2b87df0 Fix taskq dynamic spawning
Currently taskq_dispatch() will spawn new task with a condition that the caller
is also a member of the taskq. However, under this condition, it will still
cause deadlock where a task on tq1 is waiting another thread, who is trying to
dispatch a task on tq1. So this patch removes the check.

For example when you do:
zfs send pp/fs0@001 | zfs recv pp/fs0_copy

This will easily deadlock before this patch.

Also, move the seq_task check from taskq_thread_spawn() to taskq_thread()
because it's not used by the caller from taskq_dispatch().

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #496
2015-11-13 15:02:55 -08:00
Chunwei Chen 3e7e6f34d0 Don't call kmem_cache_shrink from shrinker
Linux slab will automatically free empty slab when number of partial slab is
over min_partial, so we don't need to explicitly shrink it. In fact, calling
kmem_cache_shrink from shrinker will cause heavy contention on
kmem_cache_node->list_lock, to the point that it might cause __slab_free to
livelock (see zfsonlinux/zfs#3936)

Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#3936
Closes #487
2015-11-11 13:48:31 -08:00
Brian Behlendorf 9b13f65d28 Fix CPU hotplug
Allocate a kmem cache magazine for every possible CPU which might
be added to the system.  This ensures that when one of these CPUs
is enabled it can be safely used immediately.

For many systems the number of online CPUs is identical to the
number of present CPUs so this does imply an increased memory
footprint.  In fact, dynamically allocating the array of magazine
pointers instead of using the worst case NR_CPUS can end up
decreasing our memory footprint.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #482
2015-10-13 09:50:40 -07:00
Chunwei Chen 374303a3c9 Use tab indent in rwlock.h
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #473
2015-10-02 11:21:35 -07:00
Chunwei Chen a00b3eb58f rwsem use kernel provided owner when possible
If CONFIG_RWSEM_SPIN_ON_OWNER is defined, rw_semaphore will have an owner
field, so we don't need our own.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #473
2015-10-02 11:21:32 -07:00
Chunwei Chen 4f8e643afe Don't take spin lock on rwlock owner
The spin lock around rw_owner is completely unnecessary. The reason is that it
is only modified in the down_write context. If you race against another thread
modifying it, that means that you aren't holding the rwlock, so taking the
spin lock don't eliminate the race.

Also, we only check rw_owner in RW_WRITE_HELD because spl_rwsem_is_locked
is unnecessary and might need to take spin lock.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #473
2015-10-02 11:20:55 -07:00
Brian Behlendorf 3e1e4c735c Fix spl-dkms uninstall/update
Modern versions of dkms cleanup the build directory after installing.
This resulted in 'dkms uninstall' never running because the check
added by commit 4cdcdbf which verifies the existence of the
spl.release build product would never be true.

This patch resolves the issue by updating the conditional to check
in the explicitly installed spl_config.h file for the version.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #478
2015-10-02 11:17:22 -07:00
Brian Behlendorf 2ebe396046 Fix PAX Patch/Grsec SLAB_USERCOPY panic
Support grsecurity/PaX kernel configurations where
CONFIG_PAX_USERCOPY_SLABS are enabled.  When this kernel option
is enabled slabs which are used to copy between user and kernel
space must be created with SLAB_USERCOPY.

Stock Linux kernels do not have a SLAB_USERCOPY definition so
this causes no change in behavior for non-PAX-enabled kernels.

Verified-by: Wuffleton <null@wuffleton.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2977
Issue #3796
2015-09-28 09:18:29 -07:00
Brian Behlendorf f17d005bcc Tag spl-0.6.5
META file and release log updated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-09-10 12:33:51 -07:00
Richard Yao d4bf6d8429 Disable direct reclaim in taskq worker threads on Linux 3.9+
Illumos does not have direct reclaim and code run inside taskq worker
threads is not designed to deal with it. Allowing direct reclaim inside
a worker thread can therefore deadlock. We set PF_MEMALLOC_NOIO through
memalloc_noio_save() to indicate to the kernel's reclaim code that we
are inside a context where memory allocations cannot be allowed to block
on filesystem activity.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1274
Issue zfsonlinux/zfs#2390
Closes #474
2015-09-09 14:30:47 -07:00
Brian Behlendorf 4fa4cab972 Linux 4.2 compat: misc_deregister()
The misc_deregister() function was changed to a void return type.
Rather than add compatibility code to detect this change simply
ignore the return code on all kernels.  It was only used to log
an informational error message of no real value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-09-01 09:20:45 -07:00
Tim Chase a64e55752f Create a new thread during recursive taskq dispatch if necessary
When dynamic taskq is enabled and all threads for a taskq are occupied,
a recursive dispatch can cause a deadlock if calling thread depends on
the recursively-dispatched thread for its return condition.

This patch attempts to create a new thread for recursive dispatch when
none are available.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #472
2015-09-01 08:46:41 -07:00
Brian Behlendorf 801b56090b Revert "Create a new thread during recursive taskq dispatch if necessary"
This reverts commit 076821e due to a locking issue uncovered in
subsequent testing.  An ASSERT is hit due to tq->tq_nspawn being
updated outside the lock.  The patch will need to be reworked.

VERIFY3(0 == tq->tq_nspawn) failed (0 == -1)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #472
2015-08-31 17:03:01 -07:00
Tim Chase 076821eaff Create a new thread during recursive taskq dispatch if necessary
When dynamic taskq is enabled and all threads for a taskq are occupied,
a recursive dispatch can cause a deadlock if calling thread depends on
the recursively-dispatched thread for its return condition.

This patch attempts to create a new thread for recursive dispatch when
none are available.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #472
2015-08-31 15:52:06 -07:00
Chunwei Chen ae89cf0f34 Restructure uio to accommodate bio_vec
Starting from Linux 4.1, bio_vec will be allowed to pass into filesystem via
iter_read/iter_write, so we add a bio_vec field in uio_t to hold it, and use
UIO_BVEC in segflg to determine which "vec".

Also, to be consistent to newer kernel, we make iovec and bio_vec immutable,
and make uio act as an iterator with the new uio_skip field indicating number
of bytes to skip in the first segment.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3511
Issue zfsonlinux/zfs#3640
Closes #468
2015-08-24 10:10:21 -07:00
Brian Behlendorf ebc2c9374b Linux 4.2 compat: vfs_rename()
Attempting to perform a vfs_rename() on Linux 4.2 and newer kernels
results in an EACCES error.  Rather than attempting to add and
maintain more ugly compatibility code it's best to just retire
this interface.  As a first step the SPLAT test is disabled for
Linux 4.2 and newer kernels.

  vn_rename: Failed vn_rename /tmp/vn.tmp.1 -> /tmp/vn.tmp.2 (13)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3653
2015-08-19 16:03:29 -07:00
Tim Chase 851a549305 Include other sources of freeable memory in the freemem calculation
Prevents ARC collapse when non-ZFS filesystems, the block layer or other
memory consumers use a lot of reclaimable memory in the page cache.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes zfsonlinux/zfs#3680
Closes #471
2015-08-19 09:25:30 -07:00
Brian Behlendorf 8ac6ffecaf Remove needfree, desfree, lotsfree #defines
This patch reverts 77ab5dd.  This is now possible because upstream has
refactored the ARC in such a way that these values are only used in a
few key places.  Those places have subsequently been updated to use
the Linux equivalent Linux functionality.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#3637
2015-07-30 11:45:24 -07:00
Brian Behlendorf 9dc5ffbec8 Invert minclsyspri and maxclsyspri
On Linux the meaning of a processes priority is inverted with respect
to illumos.  High values on Linux indicate a _low_ priority while high
value on illumos indicate a _high_ priority.

In order to preserve the logical meaning of the minclsyspri and
maxclsyspri macros when they are used by the illumos wrapper functions
their values have been inverted.  This way when changes are merged
from upstream illumos we won't need to remember to invert the macro.
It could also lead to confusion.

Note this change also reverts some of the priorities changes in prior
commit 62aa81a.  The rational is as follows:

spl_kmem_cache    - High priority may result in blocked memory allocs
spl_system_taskq  - May perform I/O for file backed VDEVs
spl_dynamic_taskq - New taskq threads should be spawned promptly

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue zfsonlinux/zfs#3607
2015-07-28 13:59:03 -07:00
Brian Behlendorf 4699d76d19 Remove skc_ref from alloc/free paths
As described in spl_kmem_cache_destroy() the ->skc_ref count was
added to address the case of a cache reap or grow racing with a
destroy.  They are not strictly needed in the alloc/free paths
because consumers of the cache are responsible for not using it
while it's being destroyed.

Removing this code is desirable because there is some evidence that
contention on this atomic negative impacts performance on large-scale
NUMA systems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #463
2015-07-24 11:11:45 -07:00
Brian Behlendorf 62aa81a577 Add defclsyspri macro
Add a new defclsyspri macro which can be used to request the default
Linux scheduler priority.  Neither the minclsyspri or maxclsyspri map
to the default Linux kernel thread priority.  This makes it awkward to
create taskqs which run with the same priority as the rest of the kernel
threads on the system which can lead to performance issues.

All SPL callers which previously used minclsyspri or maxclsyspri have
been changed to use defclsyspri.  The vast majority of callers were
part of the test suite which won't have an external impact.  The few
places where it could impact performance the change was from maxclsyspri
to defclsyspri.  This makes it more likely the process will be scheduled
which may help performance.

To facilitate further performance analysis the spl_taskq_thread_priority
module option has been added.  When disabled (0) all newly created kernel
threads will use the default kernel thread priority.  When enabled (1)
the specified taskq priority will be used.  By default this value is
enabled (1).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-07-23 13:25:49 -07:00
Brian Behlendorf 9eb361aaa5 Default to --disable-debug-kmem
The default kmem debugging (--enable-debug-kmem) can severely impact
performance on large-scale NUMA systems due to the atomic operations
used in the memory accounting. A 32-thread fio test running on a
40-core 80-thread system and performing 100% cached reads with kmem
debugging is:

Enabled:
READ: io=177071MB, aggrb=2951.2MB/s, minb=2951.2MB/s, maxb=2951.2MB/s,

Disabled:
READ: io=271454MB, aggrb=4524.4MB/s, minb=4524.4MB/s, maxb=4524.4MB/s,

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issues #463
2015-07-21 11:47:10 -07:00
Turbo Fredriksson 37d7cd94f3 Support parallel build trees (VPATH builds)
Build products from an out of tree build should be written
relative to the build directory.  Sources should be referred
to by their locations in the source directory.

This is accomplished by adding the 'src' and 'obj' variables
for the module Makefile.am, using relative paths to reference
source files, and by setting VPATH when source files are not
co-located with the Makefile.  This enables the following:

  $ mkdir build
  $ cd build
  $ ../configure
  $ make -s

This change also has the advantage of resolving the following
warning which is generated by modern versions of automake.

  Makefile.am:00: warning: source file 'xxx' is in a subdirectory,
  Makefile.am:00: but option 'subdir-objects' is disabled

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1082
2015-07-17 12:53:11 -07:00
Brian Behlendorf 77ab5dd33a Add memory compatibility wrappers
The function vmem_qcache_reap() and global variables 'needfree',
'desfree', and 'lotsfree' are all used in the upstream.  While
these variables have no meaning under Linux they're being defined
as 0's to avoid needing to make additional changes to the ARC code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-29 09:26:29 -07:00
Brian Behlendorf 3c82160ff2 Set TASKQ_DYNAMIC for kmem and system taskqs
Add the TASKQ_DYNAMIC flag to the kmem_cache and system taskqs
to reduce the number of idle threads on the system.  Additional
threads will be created on demand up to the previous maximum
thread counts.  This should have minimal, if any, impact on
performance.

This makes the system taskq consistent with illumos which is
always created as a dynamic taskq with up to 64 threads.

The task limits for the kmem_cache have been increased to avoid
any unnessisary throttling and to keep a larger reserve of
task_t structures on the free list.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #458
2015-06-24 15:14:25 -07:00
Brian Behlendorf f7a973d99b Add TASKQ_DYNAMIC feature
Setting the TASKQ_DYNAMIC flag will create a taskq with dynamic
semantics.  Initially only a single worker thread will be created
to service tasks dispatched to the queue.  As additional threads
are needed they will be dynamically spawned up to the max number
specified by 'nthreads'.  When the threads are no longer needed,
because the taskq is empty, they will automatically terminate.

Due to the low cost of creating and destroying threads under Linux
by default new threads and spawned and terminated aggressively.
There are two modules options which can be tuned to adjust this
behavior if needed.

* spl_taskq_thread_sequential - The number of sequential tasks,
without interruption, which needed to be handled by a worker
thread before a new worker thread is spawned.  Default 4.

* spl_taskq_thread_dynamic - Provides the ability to completely
disable the use of dynamic taskqs on the system.  This is provided
for the purposes of debugging and troubleshooting.  Default 1
(enabled).

This behavior is fundamentally consistent with the dynamic taskq
implementation found in both illumos and FreeBSD.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #458
2015-06-24 15:14:18 -07:00
Brian Behlendorf 5acb2307b2 Add IMPLY() and EQUIV() macros
Added for upstream compatibility, they are of the form:

* IMPLY(a, b) - if (a) then (b)
* EQUIV(a, b) - if (a) then (b) *AND* if (b) then (a)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-24 14:44:47 -07:00
Brian Behlendorf 2345368646 Rename cv_wait_interruptible() to cv_wait_sig()
Commit f752b46e added the cv_wait_interruptible() function to allow
condition variables to be woken by signals.  This function and its
timed wait counterpart should have been named cv_wait_sig() to match
the illumos interface which provides the same functionality.

This patch renames the symbol but leaves a #define compatibility
wrapper in place until the ZFS code can be moved to the correct
name.

This patch also makes a small number of cosmetic changes to make
the condvar source and header cstyle clean.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #456
2015-06-10 16:36:12 -07:00
Brian Behlendorf 86c16c59fe Retire rwsem_is_locked() compat
Stock Linux 2.6.32 and earlier kernels contained a broken version of
rwsem_is_locked() which could return an incorrect value.  Because of
this compatibility code was added to detect the broken implementation
and replace it with our own if needed.

The fix for this issue was merged in to the mainline Linux kernel as
of 2.6.33 and the major enterprise distributions based on 2.6.32 have
all backported the fix.  Therefore there is no longer a need to carry
this code and it can be removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #454
2015-06-10 16:35:48 -07:00
Chris Dunlop a876b0305e Make taskq_wait() block until the queue is empty
Under Illumos taskq_wait() returns when there are no more tasks
in the queue.  This behavior differs from ZoL and FreeBSD where
taskq_wait() returns when all the tasks in the queue at the
beginning of the taskq_wait() call are complete.  New tasks
added whilst taskq_wait() is running will be ignored.

This difference in semantics makes it possible that new subtle
issues could be introduced when porting changes from Illumos.
To avoid that possibility the taskq_wait() function is being
updated such that it blocks until the queue in empty.

The previous behavior remains available through the
taskq_wait_outstanding() interface.  Note that this function
was previously called taskq_wait_all() but has been renamed
to avoid confusion.

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #455
2015-06-09 12:20:12 -07:00
Brian Behlendorf dc5e8b7041 Add boot_ncpus macro
For compatibility define boot_ncpus as num_online_cpus().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-05-21 09:58:01 -07:00
Brian Behlendorf 62e2eb2329 Fix cstyle issues in spl-tsd.c
This patch only addresses the issues identified by the style checker
in spl-tsd.c.  It contains no functional changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-24 14:23:07 -07:00
Chunwei Chen 3d39d0afab Make tsd_set(key, NULL) remove the tsd entry for current thread
To prevent leaking tsd entries, we make tsd_set(key, NULL) remove the tsd
entry for the current thread. This is alright since tsd_get() returns NULL
when the entry doesn't exist.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #443
2015-04-24 14:15:22 -07:00
Richard Yao d3c677bcd3 Implement areleasef()
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #449
2015-04-24 13:02:37 -07:00
Richard Yao 313b1ea622 vn_getf/vn_releasef should not accept negative file descriptors
C type coercion rules require that negative numbers be converted into
positive numbers via wraparound such that a negative -1 becomes a
positive 1. This causes vn_getf to return a file handle when it should
return NULL whenever a positive file descriptor existed with the same
value. We should check for a negative file descriptor and return NULL
instead.

This was caught by ClusterHQ's unit testing.

Reference:
http://stackoverflow.com/questions/50605/signed-to-unsigned-conversion-in-c-is-it-always-safe

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #450
2015-04-24 13:02:00 -07:00
Brian Behlendorf cd69f020e4 Tag spl-0.6.4
META file and release log updated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-08 14:03:42 -07:00
Brian Behlendorf 2a5d574eca Clear PF_FSTRANS over vfs_sync()
When layered on XFS the following warning will be emitted under CentOS7
when entering vfs_fsync() with PF_FSTRANS already set.  This is not an
issue for other stock Linux file systems and the warning was removed
for newer kernels.  However, to avoid triggering this error PF_FSTRANS
is cleared and then reset in vn_fsync().

WARNING: at fs/xfs/xfs_aops.c:968 xfs_vm_writepage+0x5ab/0x5c0

Call Trace:
 [<ffffffff8105dee1>] warn_slowpath_common+0x61/0x80
 [<ffffffffa01706fb>] xfs_vm_writepage+0x5ab/0x5c0 [xfs]
 [<ffffffff8114b833>] __writepage+0x13/0x50
 [<ffffffff8114c341>] write_cache_pages+0x251/0x4d0
 [<ffffffff8114c60d>] generic_writepages+0x4d/0x80
 [<ffffffffa016fc93>] xfs_vm_writepages+0x43/0x50 [xfs]
 [<ffffffff8114d68e>] do_writepages+0x1e/0x40
 [<ffffffff81142bd5>] __filemap_fdatawrite_range+0x65/0x80
 [<ffffffff81142cea>] filemap_write_and_wait_range+0x2a/0x70
 [<ffffffffa017a5b6>] xfs_file_fsync+0x66/0x1f0 [xfs]
 [<ffffffff811df54b>] vfs_fsync+0x2b/0x40
 [<ffffffffa03a88bd>] vn_fsync+0x2d/0x90 [spl]
 [<ffffffffa0520c33>] spa_config_sync+0x503/0x680 [zfs]
 [<ffffffffa0520ee4>] spa_config_update+0x134/0x170 [zfs]
 [<ffffffffa0520eba>] spa_config_update+0x10a/0x170 [zfs]
 [<ffffffffa051c54f>] spa_import+0x5bf/0x7b0 [zfs]
 [<ffffffffa055c754>] zfs_ioc_pool_import+0x104/0x150 [zfs]
 [<ffffffffa056294f>] zfsdev_ioctl+0x4cf/0x5c0 [zfs]
 [<ffffffffa0562480>] ? pool_status_check+0xf0/0xf0 [zfs]
 [<ffffffff811c2c85>] do_vfs_ioctl+0x2e5/0x4c0
 [<ffffffff811c2f01>] SyS_ioctl+0xa1/0xc0
 [<ffffffff815f3219>] system_call_fastpath+0x16/0x1b

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-04-07 15:03:47 -07:00
Tim Chase ae26dd0039 Don't allow shrinking a PF_FSTRANS context
Avoid deadlocks when entering the shrinker from a PF_FSTRANS context.

This patch also reverts commit d0d5dd7 which added MUTEX_FSTRANS.  Its
use has been deprecated within ZFS as it was an ineffective mechanism
to eliminate deadlocks.  Among other things, it introduced the need for
strict ordering of mutex locking and unlocking in order that the
PF_FSTRANS flag wouldn't set incorrectly.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #446
2015-04-03 11:32:31 -07:00
Chris Dunlop c089961110 Add crgetzoneid() stub
Illumos 3897 introduces a dependency on crgetzoneid(). Stub it out until
such time as zones are implemented.

References:
  https://www.illumos.org/issues/3897
  https://github.com/illumos/illumos-gate/commit/fb7001f

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #444
2015-04-02 09:49:55 -07:00
Brian Behlendorf fade6b00b6 Add RHEL style kmod packages
Provide a Redhat specific spl-kmod.spec file which uses the old style
kmods (not kmods2) packaging.  By using the provided kmodtool script
packages can be built which support weak modules.  This allows for the
kernel to be updated without having to rebuild the SPL kernel modules.

Packages for RHEL/Centos/SL/TOSS which use this spec file can by built
as follows:

$ ./configure --with-spec=redhat
$ make rpms

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-03-27 14:42:04 -07:00
Brian Behlendorf 72998c2c9d Remove rpm/fedora directory
Originally it was thought that custom spec files might be required
for Fedora.  Happily that has turns out not to be the case.  Since
this directory just contains symlinks to the generic spec files it
can be removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-03-27 14:22:38 -07:00
Hajo Möller a4f54cf036 Fix warning about AM_INIT_AUTOMAKE arguments
As of automake 1.14.2, currently shipped with Ubuntu 14.04, automake
warns about AM_INIT_AUTOMAKE having more than one argument:

configure.ac:41: warning: AM_INIT_AUTOMAKE: two- and three-arguments forms are deprecated.  For more info, see:
configure.ac:41: http://www.gnu.org/software/automake/manual/automake.html#Modernize-AM_005fINIT_005fAUTOMAKE-invocation

This commit fixes the warnings by following above link's advice, so
AM_INIT gets called with the package's name and version. As both are
defined in the META file we're parsing it with `grep`, `cut` and `tr`.

NOTE: autoconf < 1.14 not supporting m4_esyscmd_s so m4_esyscmd was
used and modified `tr` to truncate newlines, too.

Signed-off-by: Hajo M<C3><B6>ller <dasjoe@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #438
2015-03-25 11:16:08 -07:00
Tim Chase abb642b9a9 Set HAVE_FS_STRUCT_SPINLOCK correctly when CONFIG_FRAME_WARN==1024
If kernel lock debugging is enabled, the fs_struct structure exceeds the
typical 1024 byte limit of CONFIG_FRAME_WARN and isn't enabled when it
otherwise should be.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #440
2015-03-24 13:25:25 -07:00
Tim Chase 79a0056e13 Add mutex_enter_nested() which maps to mutex_lock_nested()
Also add support for the "name" parameter in mutex_init().  The name
allows for better diagnostics, namely in /proc/lock_stats when
lock debugging is enabled.  Nested mutexes are necessary to support
CONFIG_PROVE_LOCKING. ZoL can use mutex_enter_nested()'s "class" argument
to to convey the locking hierarchy.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #439
2015-03-20 13:53:31 -07:00
Brian Behlendorf 6ab08667a4 Reduce splat_taskq_test2_impl() stack frame size
Slightly increasing the size of a kmutex_t has caused us to exceed
the stack frame warning size in splat_taskq_test2_impl().  To address
this the tq_args have been moved to the heap.

  cc1: warnings being treated as errors
  spl-0.6.3/module/splat/splat-taskq.c:358:
  error: the frame size of 1040 bytes is larger than 1024 bytes

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435
2015-03-03 10:18:31 -08:00
Brian Behlendorf d0d5dd7144 Add MUTEX_FSTRANS mutex type
There are regions in the ZFS code where it is desirable to be able
to be set PF_FSTRANS while a specific mutex is held.  The ZFS code
could be updated to set/clear this flag in all the correct places,
but this is undesirable for a few reasons.

1) It would require changes to a significant amount of the ZFS
   code.  This would complicate applying patches from upstream.

2) It would be easy to accidentally miss a critical region in
   the initial patch or to have an future change introduce a
   new one.

Both of these concerns can be addressed by adding a new mutex type
which is responsible for managing PF_FSTRANS, support for which was
added to the SPL in commit 9099312 - Merge branch 'kmem-rework'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435
2015-03-03 10:18:24 -08:00
Brian Behlendorf 5f920fbee1 Retire MUTEX_OWNER checks
To minimize the size of a kmutex_t a MUTEX_OWNER check was added.
It allowed the kmutex_t wrapper to leverage the mutex owner which was
already stored in the mutex for certain kernel configurations.

The upside to this was that it reduced the size of the kmutex_t wrapper
structure by the size of a task_struct pointer (4/8 bytes).  The
downside was that two mutex implementations needed to be maintained.
Depending on your exact kernel configuration the correct one would
be selected.

Over the years this solution worked but it could be fragile since it
depending heavily on assumed kernel mutex implementation details.  For
example the SPL_AC_MUTEX_OWNER_TASK_STRUCT configure check needed to
be added when the kernel changed how the owner was stored.  It also
made the code more complicated than it needed to be.

Therefore, in the name of simplicity and portability this optimization
is being retired.  It will slightly increase the memory requirements
for a kmutex_t but only very slightly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435
2015-03-03 10:13:33 -08:00
Brian Behlendorf a900e28e71 Fix cstyle issue in mutex.h
This patch only addresses the issues identified by the style checker
in mutex.h.  It contains no functional changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Issue #435
2015-03-03 10:13:25 -08:00
Brian Behlendorf c1bc8e610b Retire spl_module_init()/spl_module_fini()
In the original implementation of the SPL wrappers were provided
for module initialization and cleanup.  This was done to abstract
away any compatibility code which might be needed for the SPL.

As it turned out the only significant compatibility issue was that
the default pwd during module load differed under Illumos and Linux.
Since this is such as minor thing and the wrappers complicate the
code they are being retired.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#2985
2015-02-27 13:43:39 -08:00
Chunwei Chen 086476f920 Fix spl_hostid module parameter
Currently, spl_hostid module parameter doesn't do anything, because it will
always be overwritten when calling into hostid_read().
Instead, we should only call into hostid_read() when spl_hostid is not zero,
just as the comment describes.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #427
2015-02-04 16:42:25 -08:00
Brian Behlendorf c7db36a3c4 Optimize vmem_alloc() retry path
For performance reasons the reworked kmem code maps vmem_alloc() to
kmalloc_node() for allocations less than spa_kmem_alloc_max.  This
allows for more concurrency in the system and less contention of
the virtual address space.  Generally, this is a good thing.

However, in the case when the kmalloc_node() fails it makes little
sense to retry it using kmalloc_node() again.  It will likely fail
in exactly the same way.  A smarter strategy is to abandon this
optimization and retry using spl_vmalloc() which is very likely
to succeed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #428
2015-02-02 10:57:56 -08:00
Brian Behlendorf 54cccfc2e3 Fix GFP_KERNEL allocations flags
The kmem_vasprintf(), kmem_vsprintf(), kobj_open_file(), and vn_openat()
functions should all use the kmem_flags_convert() function to generate
the GFP_* flags.  This ensures that they can be safely called in any
context and the correct flags will be used.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #426
2015-01-21 15:25:19 -08:00
Brian Behlendorf 9099312977 Merge branch 'kmem-rework'
The core motivation behind these changes is to minimize the
memory management differences between ZFS on Linux and other
platforms.  This simplifies the process of porting changes to
Linux from other platforms.  This is good for code quality
and is expected to reduce the number of defects accidentally
introduced due to porting.

The key reason this is now possible is due to the addition of
Linux features such as the thread-specific PF_FSTRANS bit which
was introduced for XFS.

This patch stack also performs some refactoring and cleanup
designed to make the code more maintainable and understandable.
Finally, in the context of making and testing these changes
several bugs were identified and resolved resulting in a
more robust implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #414
2015-01-16 14:14:59 -08:00
Brian Behlendorf ee33517452 Use __get_free_pages() for emergency objects
The __get_free_pages() function must be used in place of kmalloc()
to ensure the __GFP_COMP is strictly honored.  This is due to
kmalloc() being layered on the generic Linux slab caches.  It
wasn't until recently that all caches were created using __GFP_COMP.
This means that it is possible for a kmalloc() which passed the
__GFP_COMP flag to be returned a non-compound allocation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:58:11 -08:00
Brian Behlendorf 436ad60faa Fix kmem cache deadlock logic
The kmem cache implementation always adds new slabs by dispatching a
task to the spl_kmem_cache taskq to perform the allocation.  This is
done because large slabs must be allocated using vmalloc().  It is
possible these allocations will block on IO because the GFP_NOIO flag
is not honored.  This can result in a deadlock.

Therefore, a deadlock detection strategy was implemented to deal with
this case.  When it is determined, by timeout, that the spl_kmem_cache
thread has deadlocked attempting to add a new slab.  Then all callers
attempting to allocate from the cache fall back to using kmalloc()
which does honor all passed flags.

This logic was correct but an optimization in the code allowed for a
deadlock.  Because only slabs backed by vmalloc() can deadlock in the
way described above.  An optimization was made to only invoke this
deadlock detection code for vmalloc() backed caches.  This had the
advantage of making it easy to distinguish these objects when they
were freed.

But this isn't strictly safe.  If all the spl_kmem_cache threads end
up deadlocked than we can't grow any of the other caches either.  This
can once again result in a deadlock if memory needs to be allocated
from one of these other caches to ensure forward progress.

The fix here is to remove the optimization which limits this fall back
allocation stratagy to vmalloc() backed caches.  Doing this means we
may need to take the cache lock in spl_kmem_cache_free() call path.
But this small cost can be mitigated by ignoring objects with virtual
addresses.

For good measure the default number of spl_kmem_cache threads has been
increased from 1 to 4, and made tunable.  This alone wouldn't resolve
the original issue since it's still possible for all the threads to be
deadlocked.  However, it does help responsiveness by ensuring that a
single deadlocked spl_kmem_cache thread doesn't block allocations from
other caches until the timeout is reached.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Brian Behlendorf 3018bffa9b Refine slab cache sizing
This change is designed to improve the memory utilization of
slabs by more carefully setting their size.  The way the code
currently works is problematic for slabs which contain large
objects (>1MB).  This is due to slabs being unconditionally
rounded up to a power of two which may result in unused space
at the end of the slab.

The reason the existing code rounds up every slab is because it
assumes it will backed by the buddy allocator.  Since the buddy
allocator can only performs power of two allocations this is
desirable because it avoids wasting any space.  However, this
logic breaks down if slab is backed by vmalloc() which operates
at a page level granularity.  In this case, the optimal thing to
do is calculate the minimum required slab size given certain
constraints (object size, alignment, objects/slab, etc).

Therefore, this patch reworks the spl_slab_size() function so
that it sizes KMC_KMEM slabs differently than KMC_VMEM slabs.
KMC_KMEM slabs are rounded up to the nearest power of two, and
KMC_VMEM slabs are allowed to be the minimum required size.

This change also reduces the default number of objects per slab.
This reduces how much memory a single cache object can pin, which
can result in significant memory saving for highly fragmented
caches.  But depending on the workload it may result in slabs
being allocated and freed more frequently.  In practice, this
has been shown to be a better default for most workloads.

Also the maximum slab size has been reduced to 4MB on 32-bit
systems.  Due to the limited virtual address space it's critical
the we be as frugal as possible.  A limit of 4M still lets us
reasonably comfortably allocate a limited number of 1MB objects.

Finally, the kmem:slab_small and kmem:slab_large SPLAT tests
were extended to provide better test coverage of various object
sizes and alignments.  Caches are created with random parameters
and their basic functionality is verified by allocating several
slabs worth of objects.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Brian Behlendorf e50e6cc958 Reduce kmem cache deadlock threshold
Reduce the threshold for detecting a kmem cache deadlock by 10x
from HZ to HZ/10.  The reduced value is still several orders of
magnitude large enough to avoid being triggered incorrectly.  By
reducing it we allow the system to resolve the issue more quickly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Brian Behlendorf b1c3ae48a7 Update spl-module-parameters(5) man page
The spl-module-parameters(5) was not kept up to date.  Refresh
the man page so that it lists all the possible module options,
describes what the do, and justify why the default values are
set they way the are.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Brian Behlendorf 1a20496834 Make slab reclaim more aggressive
Many people have noticed that the kmem cache implementation is slow
to release its memory.  This patch makes the reclaim behavior more
aggressive by immediately freeing a slab once it is empty.  Unused
objects which are cached in the magazines will still prevent a slab
from being freed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Richard Yao a988a35a93 Enforce architecture-specific barriers around clear_bit()
The comment above the Linux 3.16 kernel's clear_bit() states:

/**
 * clear_bit - Clears a bit in memory
 * @nr: Bit to clear
 * @addr: Address to start counting from
 *
 * clear_bit() is atomic and may not be reordered.  However, it does
 * not contain a memory barrier, so if it is used for locking purposes,
 * you should call smp_mb__before_atomic() and/or smp_mb__after_atomic()
 * in order to ensure changes are visible on other processors.
 */

This comment does not make sense in the context of x86 because x86 maps the
operations to barrier(), which is a compiler barrier. However, it does make
sense to me when I consider architectures that reorder around atomic
instructions. In such situations, a processor is allowed to execute the
wake_up_bit() before clear_bit() and we have a race. There are a few
architectures that suffer from this issue.

In such situations, the other processor would wake-up, see the bit is still
taken and go to sleep, while the one responsible for waking it up will
assume that it did its job and continue.

This patch implements a wrapper that maps smp_mb__{before,after}_atomic() to
smp_mb__{before,after}_clear_bit() on older kernels and changes our code to
leverage it in a manner consistent with the mainline kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Richard Yao c2fa09454e Add hooks for disabling direct reclaim
The port of XFS to Linux introduced a thread-specific PF_FSTRANS bit
that is used to mark contexts which are processing transactions.  When
set, allocations in this context can dip into kernel memory reserves
to avoid deadlocks during writeback.  Linux 3.9 provided the additional
PF_MEMALLOC_NOIO for disabling __GFP_IO in page allocations, which XFS
began using in 3.15.

This patch implements hooks for marking transactions via PF_FSTRANS.
When an allocation is performed in the context of PF_FSTRANS, any
KM_SLEEP allocation is transparently converted to a GFP_NOIO allocation.

Additionally, when using a Linux 3.9 or newer kernel, it will set
PF_MEMALLOC_NOIO to prevent direct reclaim from entering pageout() on
on any KM_PUSHPAGE or KM_NOSLEEP allocation.  This effectively allows
the spl_vmalloc() helper function to be used safely in a thread which
is responsible for IO.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Brian Behlendorf c3eabc75b1 Refactor generic memory allocation interfaces
This patch achieves the following goals:

1. It replaces the preprocessor kmem flag to gfp flag mapping with
   proper translation logic. This eliminates the potential for
   surprises that were previously possible where kmem flags were
   mapped to gfp flags.

2. It maps vmem_alloc() allocations to kmem_alloc() for allocations
   sized less than or equal to the newly-added spl_kmem_alloc_max
   parameter.  This ensures that small allocations will not contend
   on a single global lock, large allocations can still be handled,
   and potentially limited virtual address space will not be squandered.
   This behavior is entirely different than under Illumos due to
   different memory management strategies employed by the respective
   kernels.  However, this functionally provides the semantics required.

3. The --disable-debug-kmem, --enable-debug-kmem (default), and
   --enable-debug-kmem-tracking allocators have been unified in to
   a single spl_kmem_alloc_impl() allocation function.  This was
   done to simplify the code and make it more maintainable.

4. Improve portability by exposing an implementation of the memory
   allocations functions that can be safely used in the same way
   they are used on Illumos.   Specifically, callers may safely
   use KM_SLEEP in contexts which perform filesystem IO.  This
   allows us to eliminate an entire class of Linux specific changes
   which were previously required to avoid deadlocking the system.

This change will be largely transparent to existing callers but there
are a few caveats:

1. Because the headers were refactored and extraneous includes removed
   callers may find they need to explicitly add additional #includes.
   In particular, kmem_cache.h must now be explicitly includes to
   access the SPL's kmem cache implementation.  This behavior is
   different from Illumos but it was done to avoid always masking
   the Linux slab functions when kmem.h is included.

2. Callers, like Lustre, which made assumptions about the definitions
   of KM_SLEEP, KM_NOSLEEP, and KM_PUSHPAGE will need to be updated.
   Other callers such as ZFS which did not will not require changes.

3. KM_PUSHPAGE is no longer overloaded to imply GFP_NOIO.  It retains
   its original meaning of allowing allocations to access reserved
   memory.  KM_PUSHPAGE callers can be converted back to KM_SLEEP.

4. The KM_NODEBUG flags has been retired and the default warning
   threshold increased to 32k.

5. The kmem_virt() functions has been removed.  For callers which
   need to distinguish between a physical and virtual address use
   is_vmalloc_addr().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Brian Behlendorf b34b95635a Fix kmem cstyle issues
Address all cstyle issues in the kmem, vmem, and kmem_cache source
and headers.  This will done to make it easier to review subsequent
changes which will rework the kmem/vmem implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:09 -08:00
Brian Behlendorf e5b9b344c7 Refactor existing code
This change introduces no functional changes to the memory management
interfaces.  It only restructures the existing codes by separating the
kmem, vmem, and kmem cache implementations in the separate source and
header files.

Splitting this functionality in to separate files required the addition
of spl_vmem_{init,fini}() and spl_kmem_cache_{initi,fini}() functions.

Additionally, several minor changes to the #include's were required to
accommodate the removal of extraneous header from kmem.h.

But again, while large this patch introduces no functional changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:08 -08:00
Richard Yao 6ecf6d7228 Revert "Add PF_NOFS debugging flag"
This reverts commit eb0f407a2b in
preperation for updating the kmem/vmem infrastructure to use the
PF_FSTRANS flag.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 13:55:08 -08:00
Tim Chase 47af4b76ff Use current_kernel_time() in the time compatibility wrappers
Since the Linux kernel's utimens family of functions uses
current_kernel_time(), we need to do the same in the context of ZFS
or else there can be discrepencies in timestamps (they go backward)
if userland code does:

	fd = creat(FNAME, 0600);
	(void) futimens(fd, NULL);

The getnstimeofday() function generally returns a slightly lower time
value.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#3006
2015-01-16 13:54:35 -08:00
Brian Behlendorf 03a783534a Fix debug object on stack warning
When running the SPLAT tests on a kernel with CONFIG_DEBUG_OBJECTS=y
enabled the following warning is generated.

  ODEBUG: object is on stack, but not annotated
  WARNING: at lib/debugobjects.c:300 __debug_object_init+0x221/0x480()

This is caused by the test cases placing a debug object on the stack
rather than the heap.  This isn't harmful since they are small objects
but to make CONFIG_DEBUG_OBJECTS=y happy the objects have been relocated
to the heap.  This impacted taskq tests 1, 3, and 7.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #424
2015-01-07 13:52:20 -08:00
Chunwei Chen a3c1eb7772 mutex: force serialization on mutex_exit() to fix races
It is known that mutexes in Linux are not safe when using them to
synchronize the freeing of object in which the mutex is embedded:

http://lwn.net/Articles/575477/

The known places in ZFS which are suspected to suffer from the race
condition are zio->io_lock and dbuf->db_mtx.

* zio uses zio->io_lock and zio->io_cv to synchronize freeing
  between zio_wait() and zio_done().
* dbuf uses dbuf->db_mtx to protect reference counting.

This patch fixes this kind of race by forcing serialization on
mutex_exit() with a spin lock, making the mutex safe by sacrificing
a bit of performance and memory overhead.

This issue most commonly manifests itself as a deadlock in the zio
pipeline caused by a process spinning on the damaged mutex.  Similar
deadlocks have been reported for the dbuf->db_mtx mutex.  And it can
also cause a NULL dereference or bad paging request under the right
circumstances.

This issue any many like it are linked off the zfsonlinux/zfs#2523
issue.  Specifically this fix resolves at least the following
outstanding issues:

zfsonlinux/zfs#401
zfsonlinux/zfs#2523
zfsonlinux/zfs#2679
zfsonlinux/zfs#2684
zfsonlinux/zfs#2704
zfsonlinux/zfs#2708
zfsonlinux/zfs#2517
zfsonlinux/zfs#2827
zfsonlinux/zfs#2850
zfsonlinux/zfs#2891
zfsonlinux/zfs#2897
zfsonlinux/zfs#2247
zfsonlinux/zfs#2939

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #421
2014-12-19 10:18:47 -08:00
Ned Bass 52479ecf58 Remove compat includes from sys/types.h
Don't include the compatibility code in linux/*_compat.h in the public
header sys/types.h. This causes problems when an external code base
includes the ZFS headers and has its own conflicting compatibility code.
Lustre, in particular, defined SHRINK_STOP for compatibility with
pre-3.12 kernels in a way that conflicted with the SPL's definition.
Because Lustre ZFS OSD includes ZFS headers it fails to build due to a
'"SHRINK_STOP" redefined' compiler warning.  To avoid such conflicts
only include the compat headers from .c files or private headers.

Also, for consistency, include sys/*.h before linux/*.h then sort by
header name.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #411
2014-11-19 10:35:12 -08:00
Brian Behlendorf 8d9a23e82c Retire legacy debugging infrastructure
When the SPL was originally written Linux tracepoints were still
in their infancy.  Therefore, an entire debugging subsystem was
added to facilite tracing which served us well for many years.

Now that Linux tracepoints have matured they provide all the
functionality of the previous tracing subsystem.  Rather than
maintain parallel functionality it makes sense to fully adopt
tracepoints.  Therefore, this patch retires the legacy debugging
infrastructure.

See zfsonlinux/zfs@bc9f413 for the tracepoint changes.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #408
2014-11-19 10:35:07 -08:00
Brian Behlendorf 917fef2732 Lower minimum objects/slab threshold
As long as we can fit a minimum of one object/slab there's no reason
to prevent the creation of the cache.  This effectively pushes the
maximum object size up to 32MB.  The splat cache tests were extended
accordingly to verify this functionality.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-11-05 10:08:21 -08:00
Marcel Wysocki 7f118e836e Add config/compile to config/.gitignore
This file may be added by automake and therefore should be added
to config/.gitignore.  For the full list of possible auxiliary
programs see the full automake documentation.

http://www.gnu.org/software/automake/manual/automake.html#Auxiliary-Programs

Signed-off-by: Marcel Wysocki <maci.stgn@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-31 16:26:44 -07:00
Alexander Pyhalov 3f4a13c497 Fix modules installation directory
When building zfs modules with kernel, compiled from deb.src, the
packaging process ends up installing the modules in the wrong place.

Signed-off-by: Alexander Pyhalov <apyhalov@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#2822
2014-10-28 09:49:24 -07:00
Richard Yao fd05dde75d Kernel header installation should respect --prefix
This is the upstream component of work that enables preliminary support
for building Gentoo's ZFS packaging on other Linux systems via Gentoo
Prefix.

Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #384
2014-10-28 09:31:48 -07:00
Richard Yao ad9863e80b kmem_cache: Call constructor/destructor on each alloc/free
This has a few benefits. First, it fixes a regression that "Rework
generic memory allocation interfaces" appears to have triggered in
splat's slab_reap and slab_age tests. Second, it makes porting code from
Illumos to ZFSOnLinux easier. Third, it has the side effect of making
reclaim from slab caches that specify reclaim functions an order of
magnitude faster. The splat slab_reap test usually took 30 to 40
seconds. With this change, it takes 3 to 4.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #369
2014-10-28 09:21:08 -07:00
Tim Chase 802a4a2ad5 Linux 3.12 compat: shrinker semantics
The new shrinker API as of Linux 3.12 modifies "struct shrinker" by
replacing the @shrink callback with the pair of @count_objects and
@scan_objects.  It also requires the return value of @count_objects to
return the number of objects actually freed whereas the previous @shrink
callback returned the number of remaining freeable objects.

This patch adds support for the new @scan_objects return value semantics
and updates the splat shrinker test case appropriately.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #403
2014-10-28 09:20:13 -07:00
Brian Behlendorf 46c936756e Merge branch 'cleanup'
Over the years the SPL code bases has accumulated compatibly code
to allow it to build against a wide range of Linux kernels. In
general this is desirable because it makes the code flexible.
However, once support for these old kernels is no longer needed
and is no longer being actively tested it should be removed. This
helps keep the code simple and understandable.

The spl-0.6.x releases have supported kernels all the way back to
2.6.26. This patch stack moves that cut off up to 2.6.32 and newer
kernels. This ensures we still support all the major enterprise
distributions which are largely locked in to 2.6.32 based kernels.
And at the same time we can shed a large amount of compatibility
code which simplifies maintenance and new development.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #395
2014-10-20 08:56:50 -07:00
Brian Behlendorf dcf91382b9 Remove vfs_fsync() wrapper
The vfs_fsync() function has been available since Linux 2.6.29.
There is no longer a need to maintain this compatibility code.
However, the HAVE_2ARGS_VFS_FSYNC check was left in place
since that change occured after 2.6.32.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:52 -07:00
Brian Behlendorf 599662c538 Remove kern_path() wrapper
The kern_path() function has been available since Linux 2.6.28.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:52 -07:00
Brian Behlendorf 3d5392cefa Remove kvasprintf() wrapper
The kvasprintf() function has been available since Linux 2.6.22.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:52 -07:00
Brian Behlendorf 0fac9c9e6d Remove proc_handler() wrapper
As of Linux 2.6.32 the proc handlers where updated to expect only
five arguments.  Therefore there is no longer a need to maintain
this compatibility code and this infrastructure can be simplified.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:52 -07:00
Brian Behlendorf e03119e86f Update put_task_struct() comments
Update the comments to correctly reflect when this interface was
added.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 68a829b29d Remove credential configure checks.
The groups_search() function was never exported by a mainline kernel
therefore we drop this compatibility code and always provide our own
implementation.

Additionally, the cred_t structure has been available since 2.6.29
so there is no longer a need to maintain compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf e39174ed56 Add vfs_unlink() and vfs_rename() comments
Just for consistency with the other autoconf checks a small comment
block was added before these checks.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 137af025f6 Remove set_fs_pwd() configure check
This function has never been exported by any mainline and was only
briefly available under RHEL5.  Therefore this check is being removed
and the code update to always use the wrapper function.

The next step will be to eliminate all this code.  If ZFS were updated
not to assume that it's pwd was / there would be no need for this.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 3c49a16989 Remove user_path_dir() wrapper
The user_path_dir() function has been available since Linux 2.6.27.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 44778f4110 Remove kallsyms_lookup_name() wrapper
After the removable of get_vmalloc_info(), the unused global memory
variables, and the optional dcache/icache shrinkers there is no
longer a need for the kallsyms compatibility code.  This allows
us to eliminate another brittle area of the code by removing the
kernel upcall this functionality depended on for older kernels.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 89a461e70c Remove shrink_{i,d}node_cache() wrappers
This is optional functionality which may or may not be useful to
ZFS when using older kernels.  It is never a hard requirement.
Therefore this functionality is being removed from the SPL and
a simpler slimmed down version will be added to ZFS.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 8bbbe46f86 Remove global memory variables
Platforms such as Illumos and FreeBSD have historically provided
global variables which summerize the memory state of a system.
Linux on the otherhand doesn't expose any of this information
to kernel modules and uses entirely different mechanisms for
memory management.

In order to simplify the original ZFS port to Linux these global
variables were emulated by the SPL for the benefit of ZFS.  As ZoL
has matured over the years it has moved steadily away from these
interfaces and now no longer depends on them at all.

Therefore, this patch completely removes the global variables
availrmem, minfree, desfree, lotsfree, needfree, swapfs_minfree,
and swapfs_reserve.  This greatly simplifies the memory management
code and eliminates a common area of confusion.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf e1310afae3 Remove get_vmalloc_info() wrapper
The get_vmalloc_info() function was used to back the vmem_size()
function.  This was always problematic and resulted in brittle
code because the kernel never provided a clean interface for
modules.

However, it turns out that the only caller of this function in
ZFS uses it to determine the total virtual address space size.
This can be determined easily without get_vmalloc_info() so
vmem_size() has been updated to take this approach which allows
us to shed the get_vmalloc_info() dependency.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 50e41ab1e1 Remove on_each_cpu() wrapper
The on_each_cpu() function has been available since Linux 2.6.27.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf b652d169b0 Remove mutex_lock_nested() wrapper
The mutex_lock_nested() function has been available since Linux 2.6.18.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 2bc5666f53 Remove i_mutex() configure check
The inode structure has used i_mutex as its internal locking
primitive since 2.6.16.  The compatibility code to check for
the previous semaphore primitive has been removed.  However,
the wrapper function itself is being kept because it's entirely
possible this primitive will change again to allow finer grained
locking.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf 9f36cace41 Remove kmalloc_node() compatibility code
The kmalloc_node() function has been available since Linux 2.6.12.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf d227e114ed Remove linux/uaccess.h header check
The uaccess header has been available in the same location since
Linux 2.6.18.  There is no longer a need to maintain this
compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:51 -07:00
Brian Behlendorf e5b65e3179 Remove uintptr_t typedef
The uintptr_t typedef has been available since Linux 2.6.24.
There is no longer a need to maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:50 -07:00
Brian Behlendorf ff0582cb39 Remove atomic64_xchg() wrappers
The atomic64_xchg() and atomic64_cmpxchg() functions have been
available since Linux 2.6.24.  There is no longer a need to
maintain this compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:50 -07:00
Brian Behlendorf 82f2f1a3af Simplify the time compatibility wrappers
Many of the time functions had grown overly complex in order to
handle kernel compatibility issues.  However, as of Linux 2.6.26
all the required functionality is available.  This allows us to
retire numerous configure checks and greatly simplify the time
compatibility wrappers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:50 -07:00
Brian Behlendorf 87f8055a91 Map highbit64() to fls64()
The fls64() function has been available since Linux 2.6.16 and
it should be used to implemented highbit64().  This allows us
to provide an optimized implementation and simplify the code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:50 -07:00
Brian Behlendorf 9c91800d19 Remove CTL_UNNUMBERED sysctl interface
Support for the CTL_UNNUMBERED sysctl interface was removed in
Linux 2.6.19.  There is no longer any reason to maintain this
compatibility code.  There also issue any reason to keep around
the CTL_NAME macro and helpers so they have been retired.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:50 -07:00
Brian Behlendorf b38bf6a4e3 Remove register_sysctl() compatibility code
The register_sysctl() interface has been stable since Linux 2.6.21.
There is no longer a need to maintain compatibility code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:50 -07:00
Brian Behlendorf bb4dee3df2 Remove utsname() wrapper
There is no longer a need to wrap this because utsname() is provided
by the kernel and can be called directly.  This will require a small
change in the ZFS code because utsname is expected to be a global
structure and not a function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:11:41 -07:00
Brian Behlendorf aa363c5c05 Remove sysctl_vfs_cache_pressure assumption
The generic SPL cache shrinkers make the assumption that the
caches only contain VFS cache data and therefore should be scaled
based on vfs_cache_pressure.  This is not strictly true and it
should not be assumed.

Removing this tuning should not have any impact on the stock
behavior because vfs_cache_pressure=100 by default.  This means
that no scaling will take place.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:07:28 -07:00
Brian Behlendorf a80d69caf0 Remove adaptive mutex implementation
Since the Linux 2.6.29 kernel all mutexes have been adaptive mutexs.
There is no longer any point in keeping this code so it is being
removed to simplify the code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:07:28 -07:00
Brian Behlendorf 56cfabd3e8 Remove patches directory
There is no longer a need to carry these stale patches in the
SPL source tree.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:07:28 -07:00
Brian Behlendorf 3a92530563 Update code to use misc_register()/misc_deregister()
When the SPL was originally written it was designed to use the
device_create() and device_destroy() functions.  Unfortunately,
these functions changed considerably over the years making them
difficult to rely on.

As it turns out a better choice would have been to use the
misc_register()/misc_deregister() functions.  This interface
for registering character devices has remained stable, is simple,
and provides everything we need.

Therefore the code has been reworked to use this interface.  The
higher level ZFS code has always depended on these same interfaces
so this is also as a step towards minimizing our kernel dependencies.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:07:28 -07:00
Brian Behlendorf 0cb3dafccd Update SPLAT to use kmutex_t for portability
For consistency throughout the code update the SPLAT infrastructure
to use the wrapped mutex interfaces.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:07:28 -07:00
Brian Behlendorf 6203295438 Make license compatibility checks consistent
Apply the license specified in the META file to ensure the
compatibility checks are all performed consistently.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-17 15:07:28 -07:00
Tom Prince de2a22fcb3 Install header during post-build rather than post-install.
New versions of dkms clean up the build directory after installing.

It appears that this was always intended, but had rm -rf "/path/to/build/*"
(note the quotes), which prevented it from working.

Also, the build step is already installing stuff into the directory where
these files go, so installing our stuff there as part of build rather than
install makes sense.

Signed-off-by: Tom Prince <tom.prince@clusterhq.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #399
2014-10-09 12:00:25 -07:00
Brian Behlendorf 81857a34d1 Fix bug in SPLAT taskq:front
While running SPLAT on a kernel with CONFIG_DEBUG_ATOMIC_SLEEP
enabled the taskq:front was flagged as a test which might sleep
which in an unsafe context.  Specifically, the splat_vprint()
function which internally takes a mutex was being called under
a spin lock.  Moving the log function outside the spin lock
cleanly solves this issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-10-03 10:42:20 -07:00
Turbo Fredriksson e3020723dc Linux 3.16 compat: smp_mb__after_clear_bit()
The smp_mb__{before,after}_clear_bit functions have been renamed
smp_mb__{before,after}_atomic.  Rather than adding a compatibility
function to handle this the code has been updated to use smp_wmb().

This has the advantage of being a stable functionally equivalent
interface.  On many architectures smp_mb__after_clear_bit() expands
to smp_wmb().  Others might be able to do something slightly more
efficient but this will be safe and correct on all of them.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #386
2014-09-22 16:24:55 -07:00
stf f9bde4f74b Avoid PAGESIZE redefinition
Add #ifndef PAGESIZE to avoid redefinition warning on platforms
where this value is already provided.

Signed-off-by: stf <s@ctrlc.hu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #382
2014-08-18 08:55:41 -07:00
Richard Yao ec18fe3ce8 Cleanup vn_rename() and vn_remove()
zfsonlinux/spl#bcb15891ab394e11615eee08bba1fd85ac32e158 implemented
Linux 3.6+ support by adding duplicate vn_rename and vn_remove
functions. The new ones were cleaner, but the duplicate functions made
the codebase less maintainable. This adds some compatibility shims that
allow us to retire the older vn_rename and vn_remove in favor of the new
ones on old kernels. The result is a net 143 line reduction in lines of
code and a cleaner codebase.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #370
2014-08-13 16:25:44 -07:00
Ned Bass 2fc44f66ec Linux 3.17 compat: remove wait_on_bit action function
Linux kernel 3.17 removes the action function argument from
wait_on_bit().  Add autoconf test and compatibility macro to support
the new interface.

The former "wait_on_bit" interface required an 'action' function to
be provided which does the actual waiting. There were over 20 such
functions in the kernel, many of them identical, though most cases
can be satisfied by one of just two functions: one which uses
io_schedule() and one which just uses schedule().  This API change
was made to consolidate all of those redundant wait functions.

References: torvalds/linux@7431620

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #378
2014-08-11 14:17:00 -07:00
Brian Behlendorf f2297b5a89 Set spl_kmem_cache_slab_limit=16384 to default
For small objects the Linux slab allocator should be used to make the most
efficient use of the memory.  However, large objects are not supported by
the Linux slab and therefore the SPL implementation is preferred.  A cutoff
of 16K was determined to be optimal for architectures using 4K pages.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue #356
Closes #379
2014-08-08 08:51:45 -07:00
Brian Behlendorf c1aef26944 Set spl_kmem_cache_reclaim=0 to default
Reinstate the correct default behavior of returning the number of objects
in the cache for reclaim.  This behavior was disabled in recent releases
to do occasional reports of spinning in shrink_slabs().  Those issues have
been resolved and can no longer can be reproduced.  See commit 376dc35.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Issue #358
Closes #379
2014-08-08 08:50:03 -07:00
Tim Chase 2bf35fb754 Add atomic_swap_32() and atomic_swap_64()
The atomic_swap_32() function maps to atomic_xchg(), and
the atomic_swap_64() function maps to atomic64_xchg().

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #377
2014-07-28 14:19:24 -07:00
Tim Chase 7f23e00109 Add functions and macros as used upstream.
Added highbit64() and howmany() which are used in recent upstream
code.  Both highbit() and highbit64() should at some point be
re-factored to use the optimized fls() and fls64() functions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #363
2014-07-22 09:47:48 -07:00
Brian Behlendorf 377e12f14a Rate limit debugging stack traces
There have been issues in the past where excessive debug logging
to the console has resulted in significant performance impacts.
In the vast majority of these cases only a few stack traces are
required to diagnose the issue.  Therefore, stack traces dumped to
the console will now we limited to 5 every 60s.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes #374
2014-07-22 09:47:24 -07:00
Tim Chase f6a869614e Safer debugging and assertion macros.
Spl's debugging and assertion macros macro used the typical do/while(0)
form for if/else friendliness, however, this limits their use in contexts
where a do loop is not valid; such as within another multi-statement
style macro.

The following macros have been converted to not use do/while(0):
	PANIC, ASSERT, ASSERTF, VERIFY, VERIFY3_IMPL

PANIC has been converted to a wrapper around the new spl_PANIC() function.

The other macros have been converted to use the "&&" operator for the
branch-predicition conditional and also to use spl_PANIC().

The __ASSERT() macro was not touched.  It is only used by the debugging
infrastructure and that code, including this macro, will be retired when
the tracepoint patches are merged.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #367
2014-07-01 15:14:43 -07:00
Brian Behlendorf 31cb5383bf Tag spl-0.6.3
META file and release log updated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-06-12 11:32:38 -07:00
Turbo Fredriksson 1e929b97ac Set LANG to a reasonable default (C)
Set LANG=C before calling 'rpmbuild' to avoid rpmbuild failing on
the translated date string in the changelog.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #306
2014-06-10 14:50:11 -07:00
Brian Behlendorf 4cdcdbff63 Fix DKMS package upgrade and packager
Running 'yum upgrade spl-dkms' package could appear to work properly
and still leave you with no spl modules installed.  This will occur
when only the spl release, and not the version, are incremented.
This may be the case for a fast moving spl-testing repository.

During the upgrade process DKMS will realize that spl-x.y.z is already
installed and remove it.  DKMS then correctly builds the new modules
for spl-x.y.z.  However, as a final step when the old spl-x.y.z-r is
removed the %preun script runs and removes the newly build modules.
To handle this case the %preun script has been updated to only run
when the installed version exactly matches the full spec file version.

This change also updated ChangeLog section based on the DKMS
reference spec file.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-05-30 11:20:51 -07:00
Brian Behlendorf c4f38ddd80 Restrict release number to META version
When creating packages in a git repository the release number
can be automatically set by 'git describe'.  This normally works
well but if your repository has newer tags which match the form
NAME-VERSION* the release may be incorrectly calculated.  To
prevent this the match patten has been restricted to NAME-VERSION.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-05-29 19:08:03 -07:00
Brian Behlendorf 376dc35e22 Add spl_kmem_cache_reclaim module option
The correct behavior for all registered shrinkers is to return the
number of objects in their cache.  In theory this allows the Linux
VM to balance memory reclaim across all registered caches.

In commit b9b3715 this behavior was disabled in favor of returning
-1 which notifies the VM that no additional objects are available
for reclaim.  This was done as a workaround to resolve thrashing
in shrink_slabs() which could occur when memory was low and numerous
core where in reclaim.  Unfortunately, this has been observed to
increase the likelihood of OOM events when SPL slab consumers are
responsible for consuming the majority of memory.

Therefore, this patch makes this behavior tunable.  Setting the
spl_kmem_cache_reclaim module option to 0x1 will result in the
shrinker only being called once.  This is the default behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes #358
2014-05-22 10:30:12 -07:00
Brian Behlendorf a073aeb060 Add KMC_SLAB cache type
For small objects the Linux slab allocator has several advantages
over its counterpart in the SPL.  These include:

1) It is more memory-efficient and packs objects more tightly.
2) It is continually tuned to maximize performance.

Therefore it makes sense to layer the SPLs slab allocator on top
of the Linux slab allocator.  This allows us to leverage the
advantages above while preserving the Illumos semantics we depend
on.  However, there are some things we need to be careful of:

1) The Linux slab allocator was never designed to work well with
   large objects.  Because the SPL slab must still handle this use
   case a cut off limit was added to transition from Linux slab
   backed objects to kmem or vmem backed slabs.

   spl_kmem_cache_slab_limit - Objects less than or equal to this
   size in bytes will be backed by the Linux slab.  By default
   this value is zero which disables the Linux slab functionality.
   Reasonable values for this cut off limit are in the range of
   4096-16386 bytes.

   spl_kmem_cache_kmem_limit - Objects less than or equal to this
   size in bytes will be backed by a kmem slab.  Objects over this
   size will be vmem backed instead.  This value defaults to
   1/8 a page, or 512 bytes on an x86_64 architecture.

2) Be aware that using the Linux slab may inadvertently introduce
   new deadlocks.  Care has been taken previously to ensure that
   all allocations which occur in the write path use GFP_NOIO.
   However, there may be internal allocations performed in the
   Linux slab which do not honor these flags.  If this is the case
   a deadlock may occur.

The path forward is definitely to start relying on the Linux slab.
But for that to happen we need to start building confidence that
there aren't any unexpected surprises lurking for us.  And ideally
need to move completely away from using the SPLs slab for large
memory allocations.  This patch is a first step.

NOTES:
1) The KMC_NOMAGAZINE flag was leveraged to support the Linux slab
   backed caches but it is not supported for kmem/vmem backed caches.

2) Regardless of the spl_kmem_cache_*_limit settings a cache may
   be explicitly set to a given type by passed the KMC_KMEM,
   KMC_VMEM, or KMC_SLAB flags during cache creation.

3) The constructors, destructors, and reclaim callbacks are all
   functional and will be called regardless of the cache type.

4) KMC_SLAB caches will not appear in /proc/spl/kmem/slab due to
   the issues involved in presenting correct object accounting.
   Instead they will appear in /proc/slabinfo under the same names.

5) Several kmem SPLAT tests needed to be fixed because they relied
   incorrectly on internal kmem slab accounting.  With the updated
   test cases all the SPLAT tests pass as expected.

6) An autoconf test was added to ensure that the __GFP_COMP flag
   was correctly added to the default flags used when allocating
   a slab.  This is required to ensure all pages in higher order
   slabs are properly refcounted, see ae16ed9.

7) When using the SLUB allocator there is no need to attempt to
   set the __GFP_COMP flag.  This has been the default behavior
   for the SLUB since Linux 2.6.25.

8) When using the SLUB it may be desirable to set the slub_nomerge
   kernel parameter to prevent caches from being merged.

Original-patch-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes #356
2014-05-22 10:28:01 -07:00
Chunwei Chen ad3412efd7 Linux 3.15: vfs_rename() added a flags argument
Detect the updated vfs_rename() interface and call it with an
extra flags argument.

References:
  torvalds/linux@520c8b1

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #355
2014-05-07 13:38:17 -07:00
Chunwei Chen 1538f4b6e3 Linux 3.15 compat: NICE_TO_PRIO and PRIO_TO_NICE
These macro's were exposed to make them available to other
parts of the kernel and modules.

References:
  torvalds/linux@6b6350f

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #355
2014-05-07 13:38:03 -07:00
Andrey Vesnovaty 703371d8c7 Evenly distribute the taskq threads across available CPUs
The problem is described in commit aeeb4e0c0a.
However, instead of disabling the binding to CPU altogether we just keep the
last CPU index across calls to taskq_create() and thus achieve even
distribution of the taskq threads across all available CPUs.

The implementation based on assumption that task queues initialization
performed in serial manner.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Andrey Vesnovaty <andreyv@infinidat.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #336
2014-04-25 15:29:18 -07:00
Chunwei Chen ae16ed992b Fix crash when using ZFS on Ceph rbd
When using __get_free_pages to get high order memory, only the first page's
_count will set to 1, other's will be 0. When an internal page get passed into
rbd, it will eventully go into tcp_sendpage. There, it will be called with
get_page and put_page, and get freed erroneously when _count jump back to 0.

The solution to this problem is to use compound page. All pages in a
high order compound page share a single _count. So get_page and put_page in
tcp_sendpage will not cause _count jump to 0.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #251
2014-04-25 15:26:52 -07:00
Jorgen Lundman d6e6e4a98e Add support for aarch64 (ARMv8)
Using the ARM reference simulation (fast model foundation v8) I
cross compiled spl and zfs, to confirm it works on ARMv8 (64 bit
arm architecture, called aarch64 in Linux).

As it is based on previous ARM porting, the resulting patch is
disappointingly small, there was very little to do. The code fixes
the compile issues and has light testing done.

Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #351
2014-04-25 15:25:32 -07:00
Richard Yao 89aa97059d Change spl_kmem_cache_expire default setting to 2
This behavior is more consistent with the way memory reclaim
is expected to work under Linux.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #349
2014-04-14 16:29:01 -07:00
Andrey Vesnovaty bdfbe594a1 Expose max/min objs per slab and max slab size
By default maximal number of objects in slab can't exceed (16*2 - 1) and slab
size can't exceed 32M.
Today's high end servers having couple hundreds of RAM available for ARC may
run into a trouble with virtual memory because of the restriction mentioned
above.

Problem:
Reasons for very high number of virtual memory allocations:
	* Real slab size very small relative to the size of the entire RAM
	* Slabs allocated on virtual memory and fill entire ARC

The result is very high number of allocated virtual memory ranges (hundreds of
ranges). When virtual memory subsystem manages high number of ranges its
performance become so poor that it freezes from time to time.

Solution:
Number of objects per slab should be increased taking into account maximal
slab size which can also be increased if needed.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #337
2014-04-14 09:42:04 -07:00
Chunwei Chen 545e9ac00a Add ddi_time_after and friends
When comparing times gotten from ddi_get_lbolt, we have to take account of
wrap around of jiffies. Therefore, we cannot use 't1 < t2'. Instead we should
use 't1 - t2 < 0'.

This patch add ddi_time_after and friends to address this issue. They have
strict type restriction, clock_t for vanilla and int64_t for 64 version, to
prevent type conversion from screwing things.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #335
2014-04-14 09:32:01 -07:00
Yuxuan Shui 6c48cd8ac2 This patch add a CTASSERT macro for compile time assertion.
This macro makes the compile to spit "mixed definition and code"
warning, I can't find a way to avoid it.

This patch lays some groundwork for the persistent l2arc feature.
See https://www.illumos.org/issues/3525.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #303
2014-04-14 09:28:53 -07:00
Richard Yao acf0ade362 Simplify hostid logic
There is plenty of compatibility code for a hw_hostid
that isn't used by anything. At the same time, there are apparently
issues with the current hostid logic. coredumb in #zfsonlinux on
freenode reported that Fedora 17 changes its hostid on every boot, which
required force importing his pool. A suggestion by wca was to adopt
FreeBSD's behavior, where it treats hostid as zero if /etc/hostid does
not exist

Adopting FreeBSD's behavior permits us to eliminate plenty of code,
including a userland helper that invokes the system's hostid as a
fallback.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #224
2014-04-14 09:04:41 -07:00
Tim Chase 3ceb71e896 Call kthread_create() correctly with fixed arguments.
The kernel's kthread_create() function is defined as "..." and there is
no va_list variant at the moment.  The task name is pre-formatted into
a local buffer and passed to kthread_create() with fixed arguments.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #347
2014-04-11 09:41:40 -07:00
Tim Chase ed650dee76 De-inline spl_kthread_create().
The function was defined as a static inline with variable arguments
which causes gcc to generate errors on some distros.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #346
2014-04-09 19:17:12 -07:00
Tim Chase 17a527cb0f Support post-3.13 kthread_create() semantics.
Provide spl_kthread_create() as a wrapper to the kernel's kthread_create()
to provide pre-3.13 semantics.  Re-try if the call is interrupted or if it
would have returned -ENOMEM.  Otherwise return NULL.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #339
2014-04-08 12:44:42 -07:00
Brian Behlendorf e19101e08f splat cred:groupmember: Fix false positives
Due to certain assumptions made in the the cred:groupmember test it
could result in false positives when run on specific distributions.
This was solely a bug in the test case and not in the groupmember()
function which the test case was validating.

To prevent future false positives the test case has been rewritten
to be both more rigerous and to make fewer assumptions about the
system.

Minor style cleanup was done to cr_groups_search() and groupmember()
functions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-04-08 12:44:41 -07:00
Brian Behlendorf 668d2a0da5 splat kmem:slab_reclaim: Test cleanup
By setting __GFP_NORETRY the kernel memory reclaim logic was allowed to
abort early and dump a falled allocation stack to the console.  Since
this was done in a tight loop to fill memory it could result in a large
number of stacks being dumped to the console.  This in turn slowed down
the test sufficiently so it exceeded the time limit and failed.

To resolve this issue the __GFP_NORETRY flag is being removed.  This is
how it should have been called originally to ensure we're simulating
the behavior of most callers which will use the GFP_KERNEL flag.

In addition, the reclaim granularity of 1000 objects was far to coarse
for this to be a realistic test.  For kmem:slab_reclaim there might
only be a few thousand objects total in the cache.  Therefore, the
SPLAT_KMEM_OBJ_RECLAIM constant for these tests was lowered.  This
will cause the reclaim callback to run more frequently which makes
for a better test case.

The frequency of the cache reaping in kmem:slab_reap was increased
to accommodate the reduced number of objects released during the
reclaim.

These changes only impact the test cases and were done to remove
false positives caused by the test case itself.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-04-08 12:44:41 -07:00
Brian Behlendorf 4c995417bc Remove incorrect use of EXTRA_DIST for man pages
Setting the 'dist_' prefix is the correct way to instruct Automake
to include these files in the distribution.  The EXTRA_DIST variable
is reserved for files which are not covered by the automatic rules.

  http://www.gnu.org/software/automake/manual/automake.html#Basics

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-01-17 11:54:22 -08:00
marku89 d58a99af2f Define the needed ISA types for Sparc
Add the minimum required ISA types to support the Sparc
architecture.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Closes #317
2014-01-09 15:55:32 -08:00
Brian Behlendorf aeeb4e0c0a Remove default taskq thread to CPU bindings
When this code was written it appears to have been assumed that
every taskq would have a large number of threads.  In this case
it would make sense to attempt to evenly bind the threads over
all available CPUs.  However, it failed to consider that creating
taskqs with a small number of threads will cause the CPUs with
lower ids become over-subscribed.

For this reason the kthread_bind() call is being removed and
we're leaving the kernel to schedule these threads as it sees fit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #325
2014-01-07 10:46:24 -08:00
Brian Behlendorf 2f117de8be Include linux/vmalloc.h for ARM and Sparc
Related to issue #257 which added Linux 3.10 compatibility.  For
ARM and Sparc architectures we must explicitly include the
<linux/vmalloc.h> header to ensure the vmalloc_info structure
is always defined when available.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
Closes #291
2014-01-07 10:45:39 -08:00
Brian Behlendorf 921a35adeb Add module versioning
Use the standard Linux MODULE_VERSION macro to expose the installed
spl and splat module versions.  This will also automatically add a
checksum of the .c files and headers in "srcversion".  See:

  /sys/module/spl/version
  /sys/module/spl/srcversion
  /sys/module/splat/version
  /sys/module/splat/srcversion

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1923

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-12-06 11:03:43 -08:00
Richard Yao 50a0749eba Linux 3.13 compat: Pass NULL for new delegated inode argument
This check was originally added for SLES10, a093c6a, to check for
a 'struct vfsmount *' argument which they added.  However, since
SLES10 is based on a 2.6.16 kernel which is no longer supported
this functionality was dropped.  The checks were refactored to
support Linux 3.13 without concern for historical versions.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #312
2013-12-02 10:37:49 -08:00
Richard Yao 3e96de17d7 Linux 3.13 compat: Remove unused flags variable from __cv_init()
GCC 4.8.1 complained about an unused flags variable when building
against Linux 2.6.26.8:

/var/tmp/portage/sys-kernel/spl-9999/work/spl-9999/module/spl/../../module/spl/spl-condvar.c:
In function ‘__cv_init’:
/var/tmp/portage/sys-kernel/spl-9999/work/spl-9999/module/spl/../../module/spl/spl-condvar.c:39:6:
error: variable ‘flags’ set but not used
[-Werror=unused-but-set-variable]
  int flags = KM_SLEEP;
        ^
	cc1: all warnings being treated as errors

Additionally, the superfluous code uses a preempt_count variable that is
no longer available on Linux 3.13. Deleting the unnecessary code fixes a
Linux 3.13 compatibility issue.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #312
2013-12-02 10:11:19 -08:00
Turbo Fredriksson 30607d9b7b Document SPL module parameters.
This is a first draft of a spl-module-parameters(5) man page. I have
just extracted the parameter name and its description with modinfo,
then checked the source what type it is and its default value.

This will need more work, preferably someone that actually know these
values and what to use them for.  Similar to zfsonlinux/zfs#1856, but
for the spl.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1856
2013-11-21 12:32:41 -08:00
Brian Behlendorf dd33a169ef Retroactively fix bogus %changelog date
New versions of rpmbuild detect the invalid date which was added
incorrectly to the changelog.  To silence this noise fix it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #306
2013-11-14 14:10:19 -08:00
Cyril Plisko 9bd8cbc53d Tighten spl dependency on spl-kmod
Make spl depend on the same version of spl-kmod, rather than on same or
better. When yum repository contains a number of versions the dependency
resolution breaks on trying to install non-latest version.

Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1677
2013-11-14 13:58:35 -08:00
Richard Yao c3d9c0df3e Linux 3.12 compat: New shrinker API
torvalds/linux@24f7c6 introduced a new shrinker API while
torvalds/linux@a0b021 dropped support for the old shrinker API.
This patch adds support for the new shrinker API by wrapping
the old one with the new one.

This change also reorganizes the autotools checks on the shrinker
API such that the configure script will fail early if an unknown
API is encountered in the future.

Support for the set_shrinker() API which was used by Linux 2.6.22
and older has been dropped.  As a general rule compatibility is
only maintained back to Linux 2.6.26.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1732
Closes zfsonlinux/zfs#1822
Closes #293
Closes #307
2013-11-06 13:23:40 -08:00
Ned Bass 184c687387 Emulate illumos interface cv_timedwait_hires()
Needed for Illumos #3582. This interface is supposed to support
a variable-resolution timeout with nanosecond granularity.  This
implementation rounds up to microsecond resolution, as nanosecond-
precision timing is rarely needed for real-world performance
tuning and may incur unnecessary busy-waiting.  usleep_range() is
used if available, otherwise udelay() or msleep() are used
depending on the length of the delay interval.

Add flags from sys/callo.h as these are used to control the behavior of
cv_timedwait_hires().  Specifically,

CALLOUT_FLAG_ABSOLUTE
    Normally, the expiration passed to the timeout API functions is
    an expiration interval. If this flag is specified, then it is
    interpreted as the expiration time itself.

CALLOUT_FLAG_ROUNDUP
    Roundup the expiration time to the next resolution boundary. If this
    flag is not specified, the expiration time is rounded down.

References:
    https://www.illumos.org/issues/3582
    illumos/illumos-gate@0689f76

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #304
2013-11-04 09:49:24 -08:00
Brian Behlendorf 0f4b9a5806 Merge branch 'kstat'
This branch updates the existing kstat infrastructure to be
more flexible.  In particular, it extends the KSTAT_TYPE_RAW
type so it may be used to generate more dynamic kstats without
the need for additional custom types.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-25 13:50:12 -07:00
Ned Bass f483a97a41 3537 add kstat_waitq_enter and friends
These kstat interfaces are required to port
"Illumos #3537 want pool io kstats" to ZFS on Linux.

kstat_waitq_enter()
kstat_waitq_exit()
kstat_runq_enter()
kstat_runq_exit()

Additionally, zero out the ks_data buffer in __kstat_create() so
that the kstat_io_t counters are initialized to zero.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-25 13:41:52 -07:00
Cyril Plisko ffbf0e57c2 Kstat to use private lock by default
While porting Illumos #3537 I found that ks_lock member of kstat_t
structure is different between Illumos and SPL. It is a pointer to
the kmutex_t in Illumos, but the mutex lock itself in SPL.
Apparently Illumos kstat API allows consumer to override the lock
if required. With SPL implementation it is not possible anymore.

Things were alright until the first attempt to actually override
the lock. Porting of Illumos #3537 introduced such code for the
first time.

In order to provide the Solaris/Illumos like functionality we:
  1. convert ks_lock to "kmutex_t *ks_lock"
  2. create a new field "kmutex_t ks_private_lock"
  3. On kstat_create() ks_lock = &ks_private_lock

Thus if consumer doesn't care we still have our internal lock in use.
If, however, consumer does care she has a chance to set ks_lock to
anything else before calling kstat_install().

The rest of the code will use ks_lock regardless of its origin.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #286
2013-10-25 13:41:30 -07:00
Brian Behlendorf ce07767f79 Revert "Add KSTAT_TYPE_TXG type"
This reverts commit dba79fcbf2 in
favor of using the generic KSTAT_TYPE_RAW callbacks.  The advantage
of this approach is that arbitrary types can be added without the
need to add them to the SPL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #296
2013-10-16 14:48:35 -07:00
Prakash Surya 09f38b7e60 Add wrappers for accessing PID and command info
This change adds simple wrappers for accessing a thread's PID and
command character string.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #296
2013-10-16 14:48:35 -07:00
Prakash Surya 56d40a686b Add callbacks for displaying KSTAT_TYPE_RAW kstats
The current implementation for displaying kstats of type KSTAT_TYPE_RAW
is rather crude. This patch attempts to enhance this handling by
allowing a kstat user to register formatting callbacks which can
optionally be used.

The callbacks allow the user to implement functions for interpreting
their data and transposing it into a character buffer. This buffer,
containing a string representation of the raw data, is then be displayed
through the current /proc textual interface.

Additionally the kstats are made writable because it's now possible
to provide a useful handler via the existing ks_update() interface.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #296
2013-10-16 14:48:35 -07:00
Richard Yao 4768c0d0a6 Define SET_ERROR()
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-09 14:20:46 -07:00
Brian Behlendorf 429fe89cee Consistently use local_irq_disable/local_irq_enable
It was observed that spl_kmem_cache_alloc() uses local_irq_save()
and saves the interrupt state in a local variable.  This would
normally be fine except that spl_kmem_cache_alloc() calls
spl_cache_refill() which re-enables interrupts.  It is then
possible that while interrupts are enabled the process is
rescheduled to a different cpu before being disable again.
This could result in us restoring the saved interrupt state
from one cpu to another.

What the consequences of this are aren't perfectly clear, but
this is clearly a bug and it has the potential to cause issues.
The code has been updated to just use local_irq_enable() and
local_irq_disable() to avoid this.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-09 14:00:56 -07:00
Kohsuke Kawaguchi 6a69693961 Document how to run SPLAT
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #294
2013-10-09 13:52:59 -07:00
Ned Bass 3ecf2d2bb6 Add kpreempt() compatibility macro
This is needed for the Illumos #4045 write throttle patch.  It is used
in the arc eviction code to avoid blocking all arc activity by sitting on
arcs_mtx too long.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #286
2013-10-09 13:52:55 -07:00
Richard Yao df2c0f1849 Replace current_kernel_time() with getnstimeofday()
current_kernel_time() is used by the SPLAT, but it is not meant for
performance measurement. We modify the SPLAT to use getnstimeofday(),
which is equivalent to the gethrestime() function on Solaris.
Additionally, we update gethrestime() to invoke getnstimeofday().

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #279
2013-10-09 13:28:30 -07:00
Brian Behlendorf e90856f1d2 Tag spl-0.6.2
META file and release log updated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-16 15:17:35 -07:00
Richard Yao f7fd6ddd96 Linux 3.8 compat: Use kuid_t/kgid_t when required
When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are
replaced by kuid_t/kgid_t, which are structures instead of integral
types. This causes any code that uses an integral type to fail to build.
The User Namespace functionality introduced in Linux 3.8 requires
CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any
kernel that supported it.

We resolve this by converting between the new kuid_t/kgid_t structures
and the original uid_t/gid_t types.

Original-patch-by: DHE
Rewrite-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #260
2013-08-09 10:09:29 -07:00
Richard Yao e3c4d44886 PaX/GrSecurity Linux 3.8.y compat: Use __no_const on struct ctl_table
The PaX team started constifying `struct ctl_table` as of their Linux
3.8.0 patchset. This lead to zfsonlinux/spl#225 and Gentoo bug #463012.

While investigating our options, I learned that there is a preprocessor
directive called CONSTIFY_PLUGIN that we can use to detect the presence
of the PaX changes and adjust the code accordingly.

The PaX Team had suggested adopting ctl_table_no_const, but supporting
older kernels required declaring that whenever the CONSTIFY_PLUGIN was
set. Future compiler changes could potentially cause that to break in
the presence of -Werror, so instead we define our own spl_ctl_table
typdef and use that. This should be compatible with all PaX kernels.

This introduces a Linux kernel version number check to prevent a build
failure on versions of the PaX GCC plugin that existed for kernels
before Linux 3.8.0. Affected versions of the PaX plugin will trigger a
compiler error when they see no_const cast on a non-constified
structure.  Ordinarily, we would need an autotools check to catch that.
However, it is safe to do a kernel version check instead of an autotools
check in this specific instance because the affected versions of the PaX
GCC plugin only exist for Linux kernels before 3.8.0 and the
constification of `struct ctl_table` by the PaX developers only occurs
in Linux 3.8.0 and later.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #225
2013-08-08 09:51:34 -07:00
Richard Yao 251e7a779b Fix race in spl_kmem_cache_reap_now()
The current code contains a race condition that triggers when bit 2 in
spl.spl_kmem_cache_expire is set, spl_kmem_cache_reap_now() is invoked
and another thread is concurrently accessing its magazine.

spl_kmem_cache_reap_now() currently invokes spl_cache_flush() on each
magazine in the same thread when bit 2 in spl.spl_kmem_cache_expire is
set. This is unsafe because there is one magazine per CPU and the
magazines are lockless, so it is impossible to guarentee that another
CPU is not using its magazine when this function is called.

The solution is to only touch the local CPU's magazine and leave other
CPU's magazines to other CPUs.

Reported-by: DHE
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #274
2013-08-08 09:14:41 -07:00
Richard Yao ba06298072 Linux 3.11 compat: Replace num_physpages with totalram_pages
num_physpages was removed by
torvalds/linux@cfa11e08ed, so lets replace
it with totalram_pages.

This is a bug fix as much as it is a compatibility fix because
num_physpages did not reflect the number of pages actually available to
the kernel:

http://lkml.indiana.edu/hypermail/linux/kernel/0908.2/01001.html

Also, there are known issues with memory calculations when ZFS is in a
Xen dom0. There is a chance that using totalram_pages could resolve
them. This conjecture is untested at the time of writing.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #273
2013-08-08 09:14:29 -07:00
Brian Behlendorf 0b15402db3 Add kmod repo integration
When the kmod packaging infrastructure was originally added the
dependency on the rpmfusion yum repositories was disabled.  This
was done at the time in favour of getting local builds working.

Now the time has come to conditionally re-enable that functionality
so we can properly provide binary kmod packages.

  ./configure --with-config=srpm
  make SRPM_DEFINE_KMOD='--define="repo rpmfusion"' srpm-kmod
  mock rebuild spl-kmod-x.y.z-r.el6.src.rpm

One nice benefit of finishing this work is that the generic and
fedora spl-kmod spec files can be merged again.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-01 10:27:34 -07:00
Brian Behlendorf ceb3872825 Fix KMC_OFFSLAB type caches
Because spl_slab_size() was always returning -ENOSPC for caches of
type KMC_OFFSLAB the cache could never be created.  Additionally
the slab size is rounded up to a page which is what kv_alloc()
expects.  The kv_alloc() code will minimally allocate a page,
in the KMC_OFFSLAB case this could be reduced.

The basic regression tests kmem:slab_small, kmem:slab_large,
and kmem:slab_align regression were updated to test KMC_OFFSLAB.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Closes #266
2013-07-30 15:39:23 -07:00
Brian Behlendorf b9b3715346 Return -1 for generic kmem cache shrinker
It has been observed that it's possible to get in a state where
shrink_slabs() will spin repeated invoking the generic kmem cache
shrinker.  It fails to detect it's not making forward progress
reclaiming from the cache and doesn't give up.  To ensure this
never occurs we unconditionally return -1 after reclaiming what
we can.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes zfsonlinux/zfs#1276
Closes zfsonlinux/zfs#1598
Closes zfsonlinux/zfs#1432
2013-07-30 15:33:24 -07:00
James H c47efbc7fd Modify gethrestime to use current_kernel_time()
This allows us to get nanosecond resolution. It also means
we use the same time source as utimensat(now) etc.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #255
2013-07-15 09:17:19 -07:00
Brian Behlendorf f7f344f1b0 Improve build instructions
Make it clear that when building directly from the Git tree
the configure script must be manually generated by running the
autogen.sh script.  This requires that the GNU autotools packages
be installed for your distribution.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1448
2013-07-11 16:12:18 -07:00
Brian Behlendorf ab4e74cc38 Fix bogus kmem leak warning
Commit 5c7a036 correctly relocated the creation of a taskq
and the registraction of the kmem_cache_shrinker after the
initialization of the kmem tracking code.  However, the
cleanup of these structures was not done before the leak
checks in spl_kmem_fini().  This resulted in an incorrect
'kmem leaked' warning even though there was no actual leak.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1569
2013-07-10 15:08:22 -07:00
Brian Behlendorf b1424adda5 Fix --enable-debug-kmem-tracking option
This code has gotten something stale and no longer builds cleanly
against modern kernels.  The two issues addressed here are as
follows:

* The hlist_*_rcu interfaces in the kernel have been relatively
  unstable.  Since this isn't performance critical code just use
  the long standing hlist_* variants.

* In older kernels the hash_ptr() function takes a 'void *' but
  in newer kernels it expects a 'const void *'.  To silence the
  compiler warnings about this explicitly cast it to a 'void *'.
  The memset function is a similar case but it always expects
  a 'void *'.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #256
2013-07-09 09:23:54 -07:00
Brian Behlendorf 5bc941f3cd Merge branch 'linux-3.10'
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #257
2013-07-08 15:27:32 -07:00
Richard Yao f2a745c41d Linux 3.10 compat: Do not rely on struct proc_dir_entry definition
Linux kernel commit torvalds/linux#59d8053f moved the definition of
struct proc_dir_entry from include/linux/proc_fs.h to the private
header fs/proc/internal.h. The SPL relied on that to map Solaris'
kstat to entries in /proc/spl/kstat.

Since the proc_dir_entry structure is now private the only safe
thing to do is wrap the opaque proc handle with our own structure.
This actually ends up simplify the code and is good because it
moves us away from depending on implementation details of /proc.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
2013-07-08 15:25:18 -07:00
Yuxuan Shui 79a7ab2581 Linux 3.10 compat: add missing include of linux/slab.h
Linux kernel commit torvalds/linux@0d01ff2 changes some
includes we were depending on through linux/proc_fs.h.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
2013-07-08 15:21:28 -07:00
Yuxuan Shui 1ddf9722dc Linux 3.10 compat: replace PDE()->data with PDE_DATA()
Linux kernel commit torvalds/linux@d9dda78b renamed PDE() to
PDE_DATA().  To handle this detect the prefered interface
and define a PDE_DATA() wrapper for consistency.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
2013-07-08 15:14:21 -07:00
Yuxuan Shui c02ab72fb9 Linux 3.10 compat: struct vmalloc_info moved
Linux kernel commmit torvalds/linux@db3808c1 moved the
vmalloc_info structure from a private to a public header.
Now that it's available for kernel modules use it.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #257
2013-07-08 15:09:20 -07:00
Nathaniel Clark 485b471eb2 Add --buildroot option to kmod build
This allows rpmbuild to define buildroot to point to where kernel
data is located.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #242
2013-06-21 15:46:16 -07:00
Matthew Thode 991857cac5 Copy spl.release.in to kernel dir
Required when compiling ZFS in the kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #253
2013-06-21 15:40:04 -07:00
Brian Behlendorf ab0fdfef52 Fix ASSERT0 and VERIFY0 macro typo
Ensure the value is cast to a 'long long' for printing purposes.  The
expectation is that ASSERT0/VERIFY0 are mostly used for validating
return values and thus may commonly be negative.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #246
2013-06-21 15:38:46 -07:00
Brian Behlendorf 1c6d149feb Add ASSERT0 and VERIFY0 macros
The Illumos code introduced the ASSERT0 and VERIFY0 macros which
are to be used instead of ASSERT3S(x, ==, 0) and VERIFY3S(x, ==, 0).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Madhav Suresh <madhav.suresh@delphix.com>
Closes #246
2013-06-18 11:41:55 -07:00
Tim Chase 5c7a0369e2 Fix --enable-debug-kmem-tracking option
Re-order initialization in spl_kmem_init to allow for kmem tracing
to work.  The spl_kmem_init function calls taskq_create prior to
initializing the tracking (calling spl_kmem_init_tracking).  Since
taskq_create uses kmem_alloc, NULL dereferences occur because the
global kmem_list hasn't had its next & prev pointers initialized yet.

This commit moves the calls to spl_kmem_init_tracking earlier in the
spl_kmem_init function in order that the subsequent kmem_alloc calls
(by taskq_create) work properly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #243
2013-06-18 11:40:33 -07:00
Brian Behlendorf 99c452bbba Fix taskq_wait_id()
The existing taskq_wait_id() function can incorrectly block
indefinitely.  Reimplement it more simply using wait_event()
in a similar fashion to taskq_wait_all().

This flaw was uncovered in the context of moving vn_rdwr() to
a taskq.  Previously taskq_wait_id() had no consumers outside
the SPLAT task framework which is why the issue went unnoticed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-05-03 14:32:29 -07:00
Brian Behlendorf ab59be7bc7 Fix delay()
Somewhat amazingly it went unnoticed that the delay() function
doesn't actually cause the task to block.  Since the task state
is never changed from TASK_RUNNING before schedule_timeout() the
scheduler allows to task to continue running without any delay.
Using schedule_timeout_interruptible() resolves the issue by
correctly setting TASK_UNINTERRUPTIBLE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-05-01 16:35:47 -07:00
Brian Behlendorf f6437b60c2 Add msec/usec/nsec to tick convertors
Add wrappers for the Solaris MSEC_TO_TICK, USEC_TO_TICK, and
NSEC_TO_TICK conversion functions.  They are mapped directly to
their Linux counterparts with the exception of NSEC_TO_TICK
can cannot use usecs_to_jiffies() because it is not exported
by the kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-05-01 12:07:56 -07:00
Turbo Fredriksson 8bbda8df3e Ignore *.{deb,rpm,tar.gz} files in the top directory.
These are build products and should be ignored.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Issue zfsonlinux/zfs#1402
2013-04-24 16:18:14 -07:00
Turbo Fredriksson 16253cff43 Add --bump=0 to alien
Preserve the release field when creating Debian packages.  The
--keep-version option was not used because it results in a failure
when the git '<commit>_<hash>' syntax is used for the release.
The '_' is a valid character for RPM packages but not for DEBs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Issue zfsonlinux/zfs#1402
Issue zfsonlinux/zfs#928
2013-04-24 16:18:11 -07:00
Turbo Fredriksson 2c21370746 Support .nogitrelease file
When building a custom release in a git tree provide the ability
to prevent the release field from being overwritten by the
`git describe` output.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1402
2013-04-24 16:18:03 -07:00
Etienne Dechamps c1b20ce320 Fix various generic kmod RPM spec issues.
There are a number of issues with the generic kmod RPM spec in its
current state:
 - The "%{__id_u}" macro seems to not be available on some systems (e.g.
   Debian squeeze). It appears it has been deprecated. Use "${__id} -u"
   instead.
 - The way the "--with-linux=" configure option is generated in the
   non-RHEL/Fedora case is completely wrong with various newline and
   escaping issues (also, $kernel_version is not available in the
   generator context).

The second issue made the generator shell snippet (almost) silently
fail, which under specific circumstances can result in broken builds
against the wrong kernel sources.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #231
2013-04-24 16:17:57 -07:00
Brian Behlendorf 352bd19482 Add additional dependencies for DKMS package
For the DKMS package to successfully build the kernel-devel
headers must be included along gcc, make, and perl.  The SPL
code never directly invokes perl but the kernel build system
depends on it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1380
2013-04-02 16:07:56 -07:00
Brian Behlendorf 7fd629d430 Replace the SPL_AC_META perl dependency with awk
The only remaining perl dependency is part of the SPL_AC_META macro.
By eliminating this and replacing it with awk we can avoid the need
to pull in perl to rebuild the packages.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1380
2013-04-02 16:04:19 -07:00
Brian Behlendorf c76b1dab8d Automake 1.10.1 compat: AM_SILENT_RULES
Part of the automated testing involves building the source on Debian Lenny
which ships an ancient version of automake (1.10.1).  Historically, this
has caused a non-fatal warning about AM_SILENT_RULES not being defined.
But when the autogen.sh script was updated to use autoreconf the warning
became fatal.

  configure.ac:31: warning: macro `AM_SILENT_RULES' not found in library
  autoreconf: running: /usr/bin/autoconf --force
  configure.ac:34: error: possibly undefined macro: AM_SILENT_RULES
        If this token and others are legitimate, please use m4_pattern_allow.

To resolve this build issue the call to AM_SILENT_RULES has been wrapped
by m4_ifdef().  This prevents the macro from being expanded on platforms
where it's undefined.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-04-02 16:04:19 -07:00
Jan Engelhardt 83918aebe5 build: do not call boilerplate ourself
Rationale see section 3.5 "Using `autoreconf' to Update `configure'
Scripts" of the autoconf manual.

Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-04-02 11:08:46 -07:00
Jan Engelhardt a9e86ac4fd gitignore: anchor entries at their respective directory
.ko is specific to module, .m4 to config, etc.

Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-04-02 11:07:52 -07:00
Jan Engelhardt 92c4ea38c9 build: use CPPFLAGS
-D and -I are preprocessor flags, so should preferably be in the
appropriate variable.

Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-04-02 11:07:11 -07:00
Jan Engelhardt 7a8a639390 build: resolve orthographic and other grammatical errors
Signed-off-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-04-02 11:06:38 -07:00
Brian Behlendorf 6385874dbf Tag spl-0.6.1
META file and release log updated.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-25 13:46:47 -07:00
Brian Behlendorf 8636968f9a Provide ${kmodname}-devel-kmod for yum-builddep
In order to ensure that yum-builddep pulls in all the build
requirements a generic ${kmodname}-devel-kmod provides line is
added.  This allows a version of the development headers to be
included without requiring knowledge of the kernel version.

This is important because unlike rpmbuild which does correctly
expand the source rpm spec file, yum-builddep does not.  Without
this generic provides line mock which relies on yum-builddep is
unable to automatically satisfy the dependency.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-25 13:30:22 -07:00
Brian Behlendorf c14183adca Use 'git describe' for working builds
When building from an arbitrary commit in the git tree it's useful
for the resulting packages to be uniquely identifiable.  Therefore,
the build system has been updated to detect if your compiling in
git tree.

If you are building in a git tree, and there are commits after the
last annotated tag.  Then the <id>-<hash> component of 'git describe'
will be used to overwrite the 'Release:' field in the META file.

The only tricky part is that to ensure the 'make dist' tarball is
built using the correct release.  A dist-hook was added to the top
level make file to rewrite the META file using the correct release.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #195
Issue #111
2013-03-22 15:00:55 -07:00
Richard Yao feaf1e321d Do not call cond_resched() in spl_slab_reclaim()
Calling cond_resched() after each object is freed and then after each
slab is freed can cause slabs of objects to live for excessive periods
of time following reclaimation. This interferes with the kernel's own
memory management when called from kswapd and can cause direct reclaim
to occur in response to memory pressure that should have been resolved.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
2013-03-21 12:58:44 -07:00
Brian Behlendorf bef14fbc8c Use requested kernel for dkms builds
The --with-linux and --with-linux-obj options must be specified
as part of the dkms build otherwise the package will be built
against the running kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-20 16:01:29 -07:00
Brian Behlendorf 19e9d8fd61 Remove spl-dkms conflict with spl-kmod
Because the spl-dkms package also provides spl-kmod for the
spl user package yum flags this as a conflict.  To avoid the
problem remove the Conflicts tag from spl-dkms and just rely
on the one in spl-kmod.

  spl-dkms-0.6.0-rc14.fc18.noarch has installed conflicts
    spl-kmod: spl-dkms-0.6.0-rc14.fc18.noarch

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-20 11:33:15 -07:00
Darik Horn 4074820904 Create splat man page
The automake templates have been updated to install this man
page and the existing packaging was updated to include it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-19 13:47:12 -07:00
Brian Behlendorf 493972c896 Refresh RPM packaging
Refresh the existing RPM packaging to conform to the 'Fedora
Packaging Guidelines'.  This includes adopting the kmods2
packaging standard which is used fod kmods distributed by
rpmfusion for Fedora/RHEL.

  http://fedoraproject.org/wiki/Packaging:Guidelines
  http://rpmfusion.org/Packaging/KernelModules/Kmods2

While the spec files have been entirely rewritten from a
user perspective the only major changes are:

* The Fedora packages now have a build dependency on the
  rpmfusion repositories.  The generic kmod packages also
  have a new dependency on kmodtool-1.22 but it is bundled
  with the source rpm so no additional packages are needed.

* The kernel binary module packages have been renamed from
  spl-modules-* to kmod-spl-* as specificed by kmods2.

* The is now a common kmod-spl-devel-* package in addition
  to the per-kernel devel packages.  The common package
  contains the development headers while the per-kernel
  package contains kernel specific build products.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #222
2013-03-18 15:31:54 -07:00
Brian Behlendorf 4a6d8d2c3e Change spl-kmod-devel install path
Install the common spl kernel development headers under
/usr/src/spl-<version>/ rather than in a kernel specific
directory.  The kernel specific build products such as
spl_config.h and Modules.symvers are left installed under
/usr/src/spl-<version>/<kernel>.

This was done to be consistent with where dkms expects
kernel module source to be packaged.  It also allows for
a common spl-kmod-devel package which includes the headers,
and per-kernel spl-kmod-devel-<kernel> packages.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 12:01:05 -07:00
Brian Behlendorf 5c30c47a45 Merge branch 'linux-3.9'
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #221
2013-03-14 10:43:51 -07:00
Richard Yao 4a31e5aa9b Linux 3.9 compat: Switch to hlist_for_each{,_rcu}
torvalds/linux@b67bfe0d42 changed
hlist_for_each_entry{,_rcu} to take 3 arguments instead of 4. We handle
this by switching to hlist_for_each{,_rcu}, which works across all
supported kernels.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:34 -07:00
Richard Yao 8274ed5988 Drop support for 3 argument version of set_fs_pwd
This was a suggestion that Brian Behlendorf made when reviewing an early
pull request for Linux 3.9 support. This commit was made intentionally
easy to revert should we ever have a reason to reintroduce support for
older kernels.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:31 -07:00
Richard Yao a54718cfe0 Linux 3.9 compat: set_fs_root takes const struct path *
torvalds/linux@dcf787f391 enforces
const-correctness in passing struct path *.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:29 -07:00
Richard Yao 2a305c34c8 Linux 3.9 compat: vfs_getattr takes two arguments
The function prototype of vfs_getattr previoulsy took struct vfsmount *
and struct dentry * as arguments. These would always be defined together
in a struct path *.

torvalds/linux@3dadecce20 modified
vfs_getattr to take struct path * is taken as an argument instead.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:26 -07:00
Richard Yao bc90df6688 Linux 3.9 compat: Do not depend on f_vfsmnt
torvalds/linux@182be68478 removed the
preprocessor definition for f_vfsmnt. The ability to access the
mountpoint via ->f_path.mnt has been stable for a long time, so we
switch to that.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:23 -07:00
Richard Yao 10087fe1fa Linux 3.9 compat: Include linux/sched/rt.h
Linux 3.9 reorganized sched.h, splitting it into numerous files.
torvalds/linux@8bd75c77b7 moved MAX_PRIO
and MAX_RT_PRIO to linux/sched/rt.h.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-14 10:43:19 -07:00
Brian Behlendorf ea5c4389fb Merge branch 'build-system' 2013-03-06 15:49:57 -08:00
Ned Bass 3d6af2dd6d Refresh links to web site
Update links to refer to the official ZFS on Linux website instead of
@behlendorf's personal fork on github.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-04 19:09:34 -08:00
Brian Behlendorf 5f0a4b0847 Remove ARCH packaging
The kernel modules are now available in the Arch User Repository
(AUR) via zfs.  Since their packaging is maintained and superior
to ours it is being removed from the tree.

  https://wiki.archlinux.org/index.php/ZFS

Now that various distributions are picking up the packages we
should eventually be able to remove most of this infrastructure.
Packaging belongs with the distributions not upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-04 19:09:34 -08:00
Brian Behlendorf d1142fbffe Remove custom install-data-local for headers
Rather than use a custom install target it is cleaner to define
a 'kerneldir' and set 'kernel_HEADERS' appropriately.  This
allows us to leverage the standing configure install support.

Additionally, I took this opertunity add the missing make files
to the include subdirectories.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-01 16:55:06 -08:00
Brian Behlendorf 0298f3d67f Add KMODDIR to install target
Provide a mechanism to control the directory name the modules
are installed in.  The kernel privdes INSTALL_MOD_DIR for
this but it was hardcoded to be 'addon/spl'.

Add a KMODDIR variable which can be passed to 'make install'
to override the default directory name.  While we're here
change the default from 'addon/spl' to 'extra' which is the
kernel.org default.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-01 16:55:06 -08:00
Brian Behlendorf fea77534f0 Fix spl_config.h install permissions
The default permissions used by install are 755.  Since this
file isn't executable 644 is more appropriate.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-01 16:55:06 -08:00
Brian Behlendorf 8adf71e9b0 Remove INSTALL
The generic INSTALL instructions can be safely dropped.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-03-01 16:55:00 -08:00
Brian Behlendorf 4bf3909e51 Disable automatic log dumping
Long ago infrastructure was added to the SPL to keep an internal
debug log of the last few seconds of activity.  This was helpful
during the early development, but these days it is no longer
needed.  I haven't had to resort to this debug buffer to resolve
an issue for several years now.

Today better more generic tools like systemtap and ftrace have
evolved to the point where they can be used for this purpose.
Along with the stack trace dumped to the system console, and in
rare cases a crash dump we almost always have the debug we need.

Therefore, I'm disabling the code which automatically dumps
this log to disk during an assertion except for the case where
spl_debug_panic_on_bug is set (disabled by default).

This should be viewed as a first step towards either.

  a) Retiring this infrastructure and complexity entirely, or
  b) Integrating this logging more properly with ftrace.

As part of this change I'm also removing from the packages the
undocumented spl utility which is used to decode the binary logs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-05 16:13:27 -08:00
Richard Yao a0625691b3 Fix HAVE_MUTEX_OWNER_TASK_STRUCT autotools check on PPC64
The HAVE_MUTEX_OWNER_TASK_STRUCT fails on PPC64 with the following
error:

error: 'current' undeclared (first use in this function)

We include linux/sched.h to ensure that current is available.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-02-05 15:36:03 -08:00
Brian Behlendorf dd3678fc29 Fix atomic64_* autoconf checks
The SPL_AC_ATOMIC_SPINLOCK, SPL_AC_TYPE_ATOMIC64_CMPXCHG, and
SPL_AC_TYPE_ATOMIC64_XCHG were all directly including the
'asm/atomic.h' header.  As of Linux 3.4 this header was removed
which results in a build failure.

The right thing to do is include 'linux/atomic.h' however we
can't safely do this because it doesn't exist in 2.6.26 kernels.
Therefore, we include 'linux/fs.h' which in turn includes the
correct atomic header regardless of the kernel version.

When these incorrect APIs are used in ZFS the following build
failure results.

  arc.c:791:80: warning: '__ret' may be used uninitialized
  in this function [-Wuninitialized]
  arc.c:791:1875: error: call to '__cmpxchg_wrong_size'
  declared with attribute error: Bad argument size for cmpxchg

Since this is all Linux 2.6.24 compatibility code there's
an argument to be made that it should be removed because
kernels this old are not supported.  However, because we're
so close to a release I'm going to leave it in place for now.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#814
Closes zfsonlinux/zfs#1254
2013-02-05 10:05:46 -08:00
Brian Behlendorf 869f30f1ae SPL 0.6.0-rc14 2013-02-01 11:24:54 -08:00
Brian Behlendorf 6ef94aa67a Fix tsd_get/set() race with tsd_exit/destroy()
The tsd_exit() and tsd_destroy() functions remove entries from
hash bins without taking the hash bin lock.  They do take the
table lock, but tsd_get() and tsd_set() only take the hash bin
lock to allow for maximum concurency.

The result is that while tsd_get() and tsd_set() are traversing
the hash bin list it can be modified by another thread in which
happens to hash to the same value.  To avoid this add the needed
locking to tsd_exit() and tsd_destroy().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #174
2013-01-31 13:54:59 -08:00
Brian Behlendorf de081a2ab4 Check for KALLSYMS
Check at ./configure time that the kernel was built with kallsyms
support.  If the kernel doesn't have CONFIG_KALLSYMS defined the
modules will still compile cleanly but will not be loadable.  So
we really want to catch this early during ./configure.  Note that
we do not require CONFIG_KALLSYMS_ALL but it may be safely defined.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #6
2013-01-29 16:35:23 -08:00
Eric Dillmann 3cbfd259b7 Define BE_IN16 & BE_IN32 for lz4 compression
The new lz4 compression algorithm, zfsonlinux/zfs@9759c60, requires
the generic BE_IN16 and BE_IN32 functions.  These are added to the SPL
for other consumers to take advantage of.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-29 09:30:23 -08:00
Brian Behlendorf 0936c3449f Add spl_kmem_cache_expire module option
Cache aging was implemented because it was part of the default Solaris
kmem_cache behavior.  The idea is that per-cpu objects which haven't been
accessed in several seconds should be returned to the cache.  On the other
hand Linux slabs never move objects back to the slabs unless there is
memory pressure on the system.

This behavior is now configurable through the 'spl_kmem_cache_expire'
module option.  The value is a bit mask with the following meaning.

  0x1 - Solaris style cache aging eviction is enabled.
  0x2 - Linux style low memory eviction is enabled.

Both methods may be safely enabled simultaneously, but by default
both are disabled.  It has never been clear if the kmem cache aging
(which has been around from day one) actually does any good.  It has
however been the source of numerous bugs so I wouldn't mind retiring
it entirely.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#1227
Closes #210
2013-01-28 09:34:12 -08:00
Brian Behlendorf 84dd1f4f15 Remove spl_invalidate_inodes()
This functionality is no longer required by ZFS, see commit
zfsonlinux/zfs@7b3e34ba5a.
Since there are no other consumers, and because it adds
additional autoconf complexity which must be maintained
the spl_invalidate_inodes() function has been removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#795
2013-01-17 11:40:47 -08:00
Brian Behlendorf d4899f4747 kmem-cache: Fix slab ageing soft lockup
Commit a10287e00d slightly reworked
the slab ageing code such that it is no longer dependent on the
Linux delayed work queue interfaces.

This was good for portability and performance, but it requires us
to use the on_each_cpu() function to execute the spl_magazine_age()
function.  That means that the function is now executing in interrupt
context whereas before it was scheduled in normal process context.
And that means we need to be slightly more careful about the locking
in the interrupt handler.

With the reworked code it's possible that we'll be holding the
skc->skc_lock and be interrupted to handle the spl_magazine_age()
IRQ.  This will result in a deadlock and soft lockup errors unless
we're careful to detect the contention and avoid taking the lock in
the interupt handler.  So that's what this patch does.

Alternately, (and slightly more conventionally) we could have used
spin_lock_irqsave() to prevent this race entirely but I'd perfer to
avoid disabling interrupts as much as possible due to performance
concerns.  There is absolutely no penalty for us not aging objects
out of the magazine due to contention.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Closes zfsonlinux/zfs#1193
2013-01-14 10:07:58 -08:00
Ned Bass 8842263bd0 call_usermodehelper() should wait for process
As of Linux 3.4 the UMH_WAIT_* constants were renumbered.  In
particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for
process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the
process).  A number of call sites used the number 1 instead of the
constant name, so the behavior was not as expected on kernels with
this change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-09 16:54:19 -08:00
Brian Behlendorf 42b3ce622f Check for ZLIB_INFLATE and ZLIB_DEFLATE
Check at ./configure time that the kernel was built with zlib
support enabled.  This support may either be configured as a
module or builtin to the kernel.  But if it's missing the build
will fail so it's best to catch this early.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#582
2013-01-09 16:40:25 -08:00
Brian Behlendorf 050cd84e62 Linux compat 3.7.1, on_each_cpu()
Some kernels require that we include the 'linux/irqflags.h'
header for the SPL_AC_3ARGS_ON_EACH_CPU check.  Otherwise,
the functions local_irq_enable()/local_irq_disable() will not
be defined and the prototype will be misdetected as the four
argument version.

This change actually include 'linux/interrupt.h' which in turn
includes 'linux/irqflags.h' to be as generic as possible.

Additionally, passing NULL as the function can result in a
gcc error because the on_each_cpu() macro executes it
unconditionally.  To make the test more robust we pass the
dummy function on_each_cpu_func().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #204
2013-01-09 10:28:28 -08:00
Brian Behlendorf 1c7b3eaf87 RHEL 6.4 compat, fallocate()
In the upstream kernel the FALLOC_FL_PUNCH_HOLE #define was
introduced after the fallocate() function was moved from the
inode_operations to the file_operations structure.  Therefore,
the SPL code assumed that if FALLOC_FL_PUNCH_HOLE was defined
it was safe to use f_ops->fallocate().

Unfortunately, the RHEL6.4 kernel has only backported the
FALLOC_FL_PUNCH_HOLE #define and not the fallocate() change.

To address this compatibility issue the spl_filp_fallocate()
helper function was added to properly detect which interface
is available.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-01-08 09:53:13 -08:00
Matt Johnston 46a75aadb7 Add cv_wait_io() to account I/O time
Under Linux when a task is waiting on I/O it should call the
io_schedule() function for proper accounting.  The Solaris
cv_wait() function provides no way to specify what the cv
is waiting on therefore cv_wait_io() is introduced.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #206
2013-01-07 10:29:26 -08:00
Brian Behlendorf 02d25048d2 SPL 0.6.0-rc13 2012-12-20 11:01:47 -08:00
Brian Behlendorf 5b2fdbb69c Refresh AUTHORS
The AUTHORS file was getting stale.  Refresh its contents
using the authors listed in the git commit logs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-19 09:40:18 -08:00
Brian Behlendorf dd5b6d96f1 Remove the ChangeLog
The ChangeLog was retired long ago, the git commit logs are
authoritative.  To avoid any confusion remove the ChangeLog.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-19 09:28:18 -08:00
Brian Behlendorf 034f1b331e Fix spl_kmem_init_kallsyms_lookup() panic
Due to I/O buffering the helper may return successfully before
the proc handler has a chance to execute.  To catch this case
wait up to 1 second to verify spl_kallsyms_lookup_name_fn was
updated to a non SYMBOL_POISON value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#699
Closes zfsonlinux/zfs#859
2012-12-19 09:06:35 -08:00
Richard Yao 30196bfd42 Do not use KERNEL_DIR env var in Makefile.am
A Gentoo user reported an issue where the build system would
attempt to recurse into the kernel source tree if KERNEL_DIR
is set in the environment. KERNEL_DIR is an environment variable
that is used when the kernel sources are in a non-standard
location, so it is necessary to stop relying on it to prevent
this issue.

https://bugs.gentoo.org/show_bug.cgi?id=433946

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-17 10:59:12 -08:00
Brian Behlendorf 18e0c500a7 Merge branch 'taskq'
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #199
2012-12-12 10:45:48 -08:00
Brian Behlendorf eb0be2ed46 Removed SPL_AC_3ARGS_INIT_WORK check
All consumers of the kernel delayed work queues have been shifted
over to rely on the taskq implementation.  This compatibility code
can now be removed.  Any new callers which need this functionality
should use the taskq interfaces for delayed work items.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:57:10 -08:00
Brian Behlendorf 33e94ef1dd kmem-cache: Use a taskq for async allocations
Shift the asynchronous allocations over to use the taskq interfaces.
This allows us to abandon the kernels delayed work queue interface
and all the compatibility code it requires.

This code never actually used the delay functionality it was just
done this way to leverage the existing compatibility code.  All that
is required is a thread context to perform the allocation in.  The
only thing clever in this change is that we take advantage of the
preallocated task queue entries to avoid a memory allocation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf a10287e00d kmem-cache: Use taskqs for ageing
Shift the cache and magazine ageing functionality over to the new
delayed taskq interfaces.  This allows us to abandon the kernels
delayed work queue interface and all the compatibility code it
requires.

However, the delayed taskq interface does not allow us to schedule
a task for a specfic cpu so the ageing code was slightly reworked.
The magazine ageing delay has been directly linked to the cache
ageing function.  The spl_cache_age() function invokes on_each_cpu()
in order to run spl_magazine_age() on each cpu.  It then blocks
waiting for them to complete and promptly reclaims any free slabs.

When restructing the code wasn't the primary goal I think the
new code is far more understable and maintainable.  It also should
help minimize magazine thrashing because free slabs are immediately
released after the magazine is aged.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf 296a8e596d kmem-cache: spl_kmem_cache_create() may always sleep
When this code was originally written I went overboard and allowed
for the possibility of creating a cache in an atomic context.  In
practice there are no callers which ever do this.  This makes sense
since a cache is by design a long lived data structure.

To prevent abuse of this function going forward I'm removing the
code which is supported to handle an atomic context.  All allocators
have been updated to use KM_SLEEP and the might_sleep() debug macro
has been added to immediately detect atomic callers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf a5a98e7260 splat taskq:front: Reduce stack frame
The slightly increased size of the taskq_ent_t when debugging is
enabled has pushed the taskq:front splat test over frame size
limit.  To resolve this dynamically allocate the taskq_ent_t
structures so they are part of the heap instead of the stack.

  In function 'splat_taskq_test6_impl'
  error: the frame size of 1648 bytes is larger than 1024 bytes

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf 94ff5d38e3 splat taskq:order: Reduce stack frame
The slightly increased size of the taskq_ent_t when debugging is
enabled has pushed the taskq:order splat test over frame size
limit.  To resolve this dynamically allocate the taskq_ent_t
structures so they are part of the heap instead of the stack.

  In function 'splat_taskq_test5_impl'
  error: the frame size of 1680 bytes is larger than 1024 bytes

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:54 -08:00
Brian Behlendorf 3238e71763 splat taskq:cancel: Add test case
Add a test case for taskq_cancel_id() to verify it is working
properly.  Just like taskq:delay we start by dispatching 100
tasks.  However this time 1/3 of the tasks use taskq_dispatch()
and will be run immediately, and 2/3 use taskq_dispatch_delay().
The idea is to create a busy taskq with both active, pending,
and delayed tasks.

After all the items have been successfully dispatched the test
begins randomly canceling known task ids.  It will do this for
5 seconds randomly canceling a task id and then sleeping for a
few milliseconds.   The task being canceled may have already run,
still be on the pending list, or may be currently being executed
by a worker thread.  The idea is to ensure we catch any subtle
race conditions.

Once all the non-canceled tasks have completed we cross check
the number of tasks which ran with the number of tasks which
were successfully canceled.  Additionally, we verify that the
taskq_cancel_id() function never blocks longer than needed.
This time is bounded by the longest run time of the task which
was dispatched.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:56:49 -08:00
Brian Behlendorf 2f35782620 splat taskq:delay: Add test case
Add a test case for taskq_dispatch_delay() to verify it is working
properly.  The test dispatchs 100 tasks to a taskq with random
expiration times spread over 5 seconds.  As each task expires and
gets executed by a worker thread it verifies that it was run at
the correct time.  Once all the delayed tasks have been executed
we double check that all the dispatched tasks were successful.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:54:07 -08:00
Brian Behlendorf d9acd930b5 taskq delay/cancel functionality
Add the ability to dispatch a delayed task to a taskq.  The desired
behavior is for the task to be queued but not executed by a worker
thread until the expiration time is reached.  To achieve this two
new functions were added.

* taskq_dispatch_delay() -

  This function behaves exactly like taskq_dispatch() however it
takes a third 'expire_time' argument.  The caller should pass the
desired time the task should be executed as an absolute value in
jiffies.  The task is guarenteed not to run before this time, it
may run slightly latter if all the worker threads are busy.

* taskq_cancel_id() -

  Given a task id attempt to cancel the task before it gets executed.
This is primarily useful for canceling delay tasks but can be used for
canceling any previously dispatched task.  There are three possible
return values.

  0      - The task was found and canceled before it was executed.
  ENOENT - The task was not found, either it was already run or an
           invalid task id was supplied by the caller.
  EBUSY  - The task is currently executing any may not be canceled.
           This function will block until the task has been completed.

* taskq_wait_all() -

  The taskq_wait_id() function was renamed taskq_wait_all() to more
clearly reflect its actual behavior.  It is only curreny used by
the splat taskq regression tests.

* taskq_wait_id() -

  Historically, the only difference between this function and
taskq_wait() was that you passed the task id.  In both functions you
would block until ALL lower task ids which executed.  This was
semantically correct but could be very slow particularly if there
were delay tasks submitted.

  To better accomidate the delay tasks this function was reimplemnted.
It will now only block until the passed task id has been completed.

This is actually a fairly low risk change for a few reasons.

* Only new ZFS callers will make use of the new interfaces and
  very little common code was changed to support the new functions.

* The existing taskq_wait() implementation was not changed just
  slightly refactored.

* The newly optimized taskq_wait_id() implementation was never
  used by ZFS we can't accidentally introduce a new bug there.

NOTE: This functionality does not exist in the Illumos taskqs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:54:07 -08:00
Brian Behlendorf aed8671cb0 taskq style, remove #define wrappers
When the taskq implementation was originally written I wrapped all
the API functions in #define's.  This was done as a preventative
measure to ensure that a taskq symbol never conflicted with an
existing kernel symbol.

However, in practice the taskq symbols never conflicted.  The only
major conflicts occured with the kmem cache API.  Since this added
layer of obfuscation never bought us anything for the taskq's I'm
removing it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:54:07 -08:00
Brian Behlendorf 472a34caff taskq style, convert spaces to soft tabs
Update the taskq implementation to conform with the style used
throughout the rest of the code.  There are no functional
changes in this commit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:54:07 -08:00
Steven Johnson 794f145bf9 splat linux:shrinker: Fix fail-safe
Ensure the fail-safe is reset between successive tests.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-12 09:04:29 -08:00
Steven Johnson ca072ee70f splat linux:shrinker: Fix race condition
Ensure the test thread blocks until the shrinker has completed its
work.  This is done by putting the test thread to sleep and waking
it each time the shrinker callback runs.  Once the shrinker size
drops to zero or we time out the test is allowed to proceed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #96
Closes #125
Closes #182
2012-12-12 09:04:11 -08:00
Brian Behlendorf 576ec6aac4 splat command verbose behavior
The splat command takes a verbose option which when set prints
the internal debug log for every test.  This is helpful when
tracking down a common failure, but for a rare failure the
volume of log data is distracting.

Therefore, the verbose option has been adjusted to allow only
printing the debug log on failure.  The legacy behavior is still
available by specifying the verbose option twice.  For example:

$ splat -t all:all     # Never print the debug log
$ splat -v -t all:all  # Only print debug log on failure
$ splat -vv -t all:all # Always print the debug log

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-11 15:08:19 -08:00
Steven Johnson 9b88fa165f splat taskq:front: Fix race
The taskq:front test has a race condition where task 4 and 8
race to complete, due to an incorrectly calculated set of delay
"factors" (T). If task 4 wins and actually finishes first, the
verification of the order of completion will fail.

The delays calculated to order task completion do not take into
account the terminal line in the table, and so are all off by
a factor of 1. This causes all the tasks in all queues to finish
sooner than expected and the accumulated error is the root cause
of tasks 4 and 8 racing to complete first. Before the change the
"actual" table looks like I commented in #130.

I changed:

* the table in the comment to correctly reflect the test and the
  factor timings needed.
* the individual task delay factors of T so that ONLY 1 task will
  every 2T. (on average)
* 1T was reduced from 100ms to 50ms. This halves the duration of
  the test and makes any remaining raciness more likely to cause
  failures, but it did not cause the test to fail.
* simplified the delay factor logic by using a table look-up
  instead of a switch.
* Added a "task started" message so that with -v it is possible
  to see the order tasks are started.
* Moved the "task completed" message inside the spinlock so that
  with -v the message truly reflects the absolute order of
  completion as guaranteed by the spinlock.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #130
2012-12-05 12:23:40 -08:00
Brian Behlendorf 053678f3b0 Handle errors from spl_kern_path_locked()
When the Linux 3.6 KERN_PATH_LOCKED compatibility code was added
by commit bcb1589 an entirely new vn_remove() implementation was
added.  That function did not properly handle an error from
spl_kern_path_locked() which would result in an panic.  This
patch addresses the issue by returning the error to the caller.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #187
2012-12-03 12:06:25 -08:00
Brian Behlendorf b84412a6e8 Linux compat 3.7, kernel_thread()
The preferred kernel interface for creating threads has been
kthread_create() for a long time now.  However, several of the
SPLAT tests still use the legacy kernel_thread() function which
has finally been dropped (mostly).

Update the condvar and rwlock SPLAT tests to use the modern
interface.  Frankly this is something we should have done a
long time ago.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #194
2012-12-03 09:36:21 -08:00
Brian Behlendorf 251677e98f Verify --with-linux source directory exists
Previously this check was only performed when ./configure was
attempting to autodetect your kernel source directory.  But we
should also handle the case where --with-linux was provided
and is obviously wrong.  This way we catch the error before
invoking make and compiling the source with an incorrect
autoconf results.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #162
2012-11-29 15:05:54 -08:00
Brian Behlendorf 043f9b5724 Disable FS reclaim when allocating new slabs
Allowing the spl_cache_grow_work() function to reclaim inodes
allows for two unlikely deadlocks.  Therefore, we clear __GFP_FS
for these allocations.  The two deadlocks are:

* While holding the ZFS_OBJ_HOLD_ENTER(zsb, obj1) lock a function
  calls kmem_cache_alloc() which happens to need to allocate a
  new slab.  To allocate the new slab we enter FS level reclaim
  and attempt to evict several inodes.  To evict these inodes we
  need to take the ZFS_OBJ_HOLD_ENTER(zsb, obj2) lock and it
  just happens that obj1 and obj2 use the same hashed lock.

* Similar to the first case however instead of getting blocked
  on the hash lock we block in txg_wait_open() which is waiting
  for the next txg which isn't coming because the txg_sync
  thread is blocked in kmem_cache_alloc().

Note this isn't a 100% fix because vmalloc() won't strictly
honor __GFP_FS.  However, it practice this is sufficient because
several very unlikely things must all occur concurrently.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#1101
2012-11-27 13:43:27 -08:00
Brian Behlendorf e71a4534b3 SPL 0.6.0-rc12 2012-11-13 14:28:25 -08:00
Brian Behlendorf 366346c565 Merge branch 'kmem-cache-optimization'
This branch contains kmem cache optimizations designed to resolve
the lockups reported in zfsonlinux/zfs#922.  The lockups were
largely the result of spin lock contention in the slab under low
memory conditions.  Fundamentally, these changes are all designed
to minimize that contention though a variety of methods.

  * Improved vmem cached deadlock detection
  * Track emergency objects in rbtree
  * Optimize spl_kmem_cache_free()
  * Never spin in kmem_cache_alloc()

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#922
2012-11-08 11:09:17 -08:00
Brian Behlendorf dc1b30224f Never spin in kmem_cache_alloc()
If we are reaping from the cache and a concurrent allocation
occurs then the caller must block until the reaping is complete.
This is signaled by the clearing of the KMC_BIT_REAPING bit.

Otherwise the caller will be in a tight loop which takes and
releases the skc->skc_cache lock.  When there are multiple
concurrent callers the system will thrash on the lock and
appear to lock up.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 15:48:39 -08:00
Brian Behlendorf a1af8fb1ea Optimize spl_kmem_cache_free()
Because only virtual slabs may have emergency objects and these
objects are guaranteed to have physical addresses.  It can be
easily determined if the passed object is a virtual slab object
or an emergency object.  This allows us to completely optimize
the emergency object free case out of the common free path.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:54:19 -08:00
Brian Behlendorf ed3163484d Track emergency object in rbtree
In the initial implementation emergency objects were tracked on a
per-cache list.  The assumption was that under normal operation we
would never allocate more than a handful of these objects.  So the
cost of walking the list during free was expected to be negligible.

However real world usage has shown that emergency objects tend to
be allocated in batches.  A deadlock will be detected and several
thousand emergency objects will be allocated before the original
blocked slab allocation can complete.

Therefore the original list has been replaced by a red black tree
which is sorted by the memory address of each allocated object.
This bounds the worst case insertion and removal time to O(log n)
which minimize contention on the assoicated spin lock.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:54:19 -08:00
Brian Behlendorf 165f13c33a Improved vmem cached deadlock detection
The entire goal of performing the slab allocations asynchronously
is to be able to detect when a vmalloc() deadlocks.  In this case,
and only this case, do we want to start allocating emergency objects.
The trick here is to minimize false positives because the overhead
of tracking emergency objects is far higher than normal slab objects.

With that goal in mind the code was reworked to be less sensitive
to slow allocations by increasing the wait time.  Once a cache is
is marked deadlocked all subsequent allocations which can not be
satisfied with existing cache objects will immediately allocate new
emergency objects.  This behavior persists until the asynchronous
allocation completes and clears the deadlocked flag.

The result of these tweaks is that far fewer emergency objects
get created which is important because this minimizes the cost of
releasing them latter in kmem_cache_free().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:54:15 -08:00
Brian Behlendorf 65c2fc5a2e Merge branch 'splat'
Additional debugging, some cleanup, and an assortment of fixes
to the SPLAT tests and infrastructure.  Full details in the
individual patches.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:49:14 -08:00
Brian Behlendorf 1112486356 splat kmem:slab_overcommit: Disabled
Disable this test because it may result in an OOM event on the
system which can result in the test infrastructure being killed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:57 -08:00
Brian Behlendorf b8296bf3e6 splat atomic:64-bit: Create thread outside spin lock
The Fedora 3.6 debug kernel identified the following issue where
we create a thread under a spin lock.  This isn't safe because
sleeping could result in a deadlock.  Therefore the lock is changed
to a mutex so it's safe to sleep.

  BUG: sleeping function called from invalid context at mm/slub.c:930
  in_atomic(): 1, irqs_disabled(): 0, pid: 10583, name: splat
  1 lock held by splat/10583:

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:57 -08:00
Brian Behlendorf 0e149d4204 splat: Fix log buffer locking
The Fedora 3.6 debug kernel identified the following issue where
we call copy_to_user() under a spin lock().  This used to be safe
in older kernels but no longer appears to be true so the spin
lock was changed to a mutex.  None of this code is performance
critical so allowing the process to sleep is harmless.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:56 -08:00
Brian Behlendorf df870a697f splat: Cleanup headers
Restructure the the SPLAT headers such that each test only
includes the minimal set of headers it requires.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:56 -08:00
Brian Behlendorf d2733258d0 Condition variable reference counts
Reference count every entry and exit from the condition variable
functions: cv_wait(), cv_wait_timeout(), cv_signal(), cv_broadcast().

This allows us to safely block in cv_destroy() until all consumers
have been scheduled and are no longer accessing the condition
variable memory.

In addition poison the magic value at the start of cv_destroy() to
ensure there are never any new callers after cv_destroy() is called.
The consumer is responsible for ensuring this never occurs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:55 -08:00
Brian Behlendorf 87efc30b27 Merge remote branch 'eris/stats'
Bring in support for the new KSTAT_TYPE_TXG type.  This allows for
additional visibility in to the txg handling.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-06 14:48:43 -08:00
Brian Behlendorf dba79fcbf2 Add KSTAT_TYPE_TXG type
Add a new kstat type for tracking useful statistics about a TXG.
The new KSTAT_TYPE_TXG type can be used to tracks the following
statistics per-txg.

  txg          - Unique txg number
  state        - State (O)pen/(Q)uiescing/(S)yncing/(C)ommitted
  birth;       - Creation time
  nread        - Bytes read
  nwritten;    - Bytes written
  reads        - IOPs read
  writes       - IOPs write
  open_time;   - Length in nanoseconds the txg was open
  quiesce_time - Length in nanoseconds the txg was quiescing
  sync_time;   - Length in nanoseconds the txg was syncing

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-11-02 15:17:40 -07:00
Brian Behlendorf 71c9f0b003 Make kstat.ks_update() callback atomic
Move the kstat ks_update() callback under the ks_lock.  This
enables dynamically sized kstats without modification to the
kstat API.

  * Create a kstat with the KSTAT_FLAG_VIRTUAL flag.
  * Register a ->ks_update() callback which does:
    o Frees any existing ks_data buffer.
    o Set ks_data_size to the kstat array size.
    o Set ks_data to an allocated buffer of size ks_data_size
    o Populate the array of buffers with the required data.

The buffer allocated in the ks_update() callback is guaranteed
to remain allocated and valid while the proc sequence handler
iterates over the buffer.  The lock will not be dropped until
kstat_seq_stop() function is run making it safe for concurrent
access.  To allow the ks_update() callback to perform memory
allocations the lock was changed to a mutex.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-23 09:36:19 -07:00
Brian Behlendorf 1e0c2c2ccf Linux 3.7 compat, __clear_close_on_exec() removed
Commit torvalds/linux@b8318b0 moved the __clear_close_on_exec()
function out of include/linux/fdtable.h and in to fs/file.c
making it unavailable to the SPL.

Now as it turns out we only used this function to tear down
some test infrastructure for the vn_getf()/vn_releasef() SPLAT
regression tests.  Rather than implement even more autoconf
compatibilty code to handle this we just remove the test case.
This also allows us to drop three existing autoconf tests.

This does mean the SPLAT tests will no longer verify these
functions but historically they have never been a problem.
And if we feel we absolutely need this test coverage I'm
sure a more portable version of the test case could be added.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #183
2012-10-18 13:36:44 -07:00
Yuxuan Shui bcb15891ab Linux 3.6 compat, kern_path_locked() added
The kern_path_parent() function was removed from Linux 3.6 because
it was observed that all the callers just want the parent dentry.
The simpler kern_path_locked() function replaces kern_path_parent()
and does the lookup while holding the ->i_mutex lock.

This is good news for the vn implementation because it removes the
need for us to handle the locking.  However, it makes it harder to
implement a single readable vn_remove()/vn_rename() function which
is usually what we prefer.

Therefore, we implement a new version of vn_remove()/vn_rename()
for Linux 3.6 and newer kernels.  This allows us to leave the
existing working implementation untouched, and to add a simpler
version for newer kernels.

Long term I would very much like to see all of the vn code removed
since what this code enabled is generally frowned upon in the kernel.
But that can't happen util we either abondon the zpool.cache file
or implement alternate infrastructure to update is correctly in
user space.

Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #154
2012-10-14 16:26:21 -07:00
Massimo Maggi dea3505dff Switch KM_SLEEP to KM_PUSHPAGE
In this particular instance the allocation occurred in the context
of sys_msync()->...->zpl_putpage() where we must be careful not to
initiate additional I/O.

Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-10-11 16:22:29 -07:00
Etienne Dechamps bbdc6ae495 Add interface for file hole punching.
This adds an interface to "punch holes" (deallocate space) in VFS
files. The interface is identical to the Solaris VOP_SPACE interface.
This interface is necessary for TRIM support on file vdevs.

This is implemented using Linux fallocate(FALLOC_FL_PUNCH_HOLE), which
was introduced in 2.6.38. For a brief time before 2.6.38 this was done
using the truncate_range inode operation, which was quickly deprecated.
This patch only supports FALLOC_FL_PUNCH_HOLE.

This adds support for the truncate_range() inode operation to
VOP_SPACE() for file hole punching. This API is deprecated and removed
in 3.5, so it's only useful for old kernels.

On tmpfs, the truncate_range() inode operation translates to
shmem_truncate_range(). Unfortunately, this function expects the end
offset to be inclusive and aligned to the end of a page. If it is not,
the kernel will stop with a BUG_ON().

This patch fixes the issue by adapting to the constraints set forth by
shmem_truncate_range().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #168
2012-10-04 16:22:07 -07:00
Brian Behlendorf a6c6839a88 SPL 0.6.0-rc11 2012-09-18 11:28:57 -07:00
Brian Behlendorf 3050c9314f Switch KM_SLEEP to KM_PUSHPAGE
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system.  To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs.  This will prevent them from attempting
to initiate any I/O during direct reclaim.

This change was originally part of cd5ca4b but was reverted by
330fe01.  It always should have had its own commit for exactly
this reason.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-12 12:27:09 -07:00
Brian Behlendorf 9b51f21841 Remove TQ_SLEEP -> KM_SLEEP mapping
When the taskq code was originally written it seemed like a good
idea to simply map TQ_SLEEP to KM_SLEEP.  Unfortunately, this
assumed that the TQ_* flags would never confict with any of the
Linux GFP_* flags.  When adding the TQ_PUSHPAGE support in commit
cd5ca4b this invariant was accidentally broken.

Therefore to support TQ_PUSHPAGE, which is needed for Linux, and
prevent any further confusion I have removed this direct mapping.
The TQ_SLEEP, TQ_NOSLEEP, and TQ_PUSHPAGE are no longer defined
in terms of their KM_* counterparts.  Instead a simple mapping
function is introduce to convert TQ_* -> KM_* where needed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #171
2012-09-12 11:41:42 -07:00
Brian Behlendorf 330fe010e4 Revert "Switch KM_SLEEP to KM_PUSHPAGE"
This reverts commit cd5ca4b2f8
due to conflicts in the higher TQ_ bits which caused incorrect
behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-12 10:07:48 -07:00
Chris Dunlop dd87332f47 Remove autotools products
spl_config.h.in is a generated file: remove and .gitignore it

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-11 10:12:47 -07:00
Brian Behlendorf 3c60f5054c Debug cv_destroy() with mutex held
There still appears to be a race in the condition variables where
->cv_mutex is set after we are woken from the cv_destroy wait queue.
This might be possible when cv_destroy() is called immediately after
cv_broadcast().  We had some troubles with this previously but
there may still be a small race, see commit d599e4f.

The following patch closes one small race and improves the ASSERTs
such that they log the offending value.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#943
2012-09-10 10:23:26 -07:00
Brian Behlendorf 95331f4437 Set KMC_NOEMERGENCY for zlib workspaces
The workspace required by zlib to perform compression is roughly
512MB (order-7).  These allocations are so large that we should
never attempt to directly kmalloc an emergency object for them.

It is far preferable to asynchronously vmalloc an additional slab
in case it's needed.  Then simply block waiting for an existing
object to be released or for the new slab to be allocated.

This can be accomplished by disabling emergency slab objects by
passing the KMC_NOEMERGENCY flag at slab creation time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs#917
2012-09-07 14:36:26 -07:00
Brian Behlendorf cb5c2acebb Add KMC_NOEMERGENCY slab flag
Provide a flag to disable the use of emergency objects for a
specific kmem cache.  There may be instances where under no
circumstances should you kmalloc() an emergency object.  For
example, when you cache contains very large objects (>128k).

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-09-07 14:27:03 -07:00
Etienne Dechamps ac8ca67a88 Add DKIOCTRIM for TRIM support.
See dechamps/zfs@cc6cd40ad7 for details.

This harmless addition was merged to simplify testing the ZFS TRIM
support patches.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #167
2012-09-02 14:22:01 -07:00
Brian Behlendorf 46b3945d5d Suppress task_hash_table_init() large allocation warning
When various kernel debuging options are enabled this allocation
may be larger than usual as shown by the following warning.  It
is in no way harmful so we suppress the warning.

  SPL: large kmem_alloc(40960, 0x80d0) at
  tsd_hash_table_init:358 (76495/76495)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #93
2012-08-30 21:02:52 -07:00
Brian Behlendorf efcd0ca32d Enhance SPLAT kmem:slab_overcommit test
After the emergency slab objects were merged I started observing
timeout failures in the kmem:slab_overcommit test.  These were
due to the ineffecient way the slab_overcommit reclaim function
was implemented.  And due to the additional cost of potentially
allocating ten of thousands of emergency objects and tracking
them on a single list.

This patch addresses the first concern by enhansing the test
case to trace all of the allocations objects as a linked list.
This allows for a cleaner version of the reclaim function to
simply release SPLAT_KMEM_OBJ_RECLAIM objects.

Since this touches some common code all the tests which share
these data structions were also updated.  After making these
changes slab_overcommit is reliably passing.  However, there
is certainly additional cleanup which could be done here.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-30 15:49:00 -07:00
Brian Behlendorf cd5ca4b2f8 Switch KM_SLEEP to KM_PUSHPAGE
Under certain circumstances the following functions may be called
in a context where KM_SLEEP is unsafe and can result in a deadlocked
system.  To avoid this problem the unconditional KM_SLEEPs are
converted to KM_PUSHPAGEs.  This will prevent them from attempting
to initiate any I/O during direct reclaim.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf 3e904f40b4 Mutex ASSERT on self deadlock
Generate an assertion if we're going to deadlock the system by
attempting to acquire a mutex the process is already holding.

There are currently no known instances of this under normal
operation, but it _might_ be possible when using a ZVOL as a
swap device.  I want to ensure we catch this immediately if it
were to occur.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf eb0f407a2b Add PF_NOFS debugging flag
PF_NOFS is a per-process debug flag which is set in current->flags to
detect when a process is performing an unsafe allocation.  All tasks
with PF_NOFS set must strictly use KM_PUSHPAGE for allocations because
if they enter direct reclaim and initiate I/O they may deadlock.

When debugging is disabled, any incorrect usage will be detected and
a call stack with a warning will be printed to the console.  The flags
will then be automatically corrected to allow for safe execution.  If
debugging is enabled this will be treated as a fatal condition.

To avoid any risk of conflicting with the existing PF_ flags.  The
PF_NOFS bit shadows the rarely used PF_MUTEX_TESTER bit.  Only when
CONFIG_RT_MUTEX_TESTER is not set, and we know this bit is unused,
will the PF_NOFS bit be valid.  Happily, most existing distributions
ship a kernel with CONFIG_RT_MUTEX_TESTER disabled.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf 500e95c884 Revert "Disable vmalloc() direct reclaim"
This reverts commit 2092cf68d8.  The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks.  Those issues are believed to be resolved so this
workaround can be safely reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf 617f79de6a Revert "Fix NULL deref in balance_pgdat()"
This reverts commit b8b6e4c453.  The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks.  Those issues are believed to be resolved so this
workaround can be safely reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf bc03e07a7c Revert "Detect kernels that honor gfp flags passed to vmalloc()"
This reverts commit 36811b4430.
Which is no longer required because there is now SPL code in
place to safely handle the deadlocks the kernel patch was designed
to address.  Therefore we can unconditionally use vmalloc() and
drop all the PF_MEMALLOC code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:55 -07:00
Brian Behlendorf d47e664ad4 Revert "Add TASKQ_NORECLAIM flag"
This reverts commit 372c257233.  The
use of the PF_MEMALLOC flag was always a hack to work around memory
reclaim deadlocks.  Those issues are believed to be resolved so this
workaround can be safely reverted.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:42 -07:00
Brian Behlendorf e2dcc6e2b8 Emergency slab objects
This patch is designed to resolve a deadlock which can occur with
__vmalloc() based slabs.  The issue is that the Linux kernel does
not honor the flags passed to __vmalloc().  This makes it unsafe
to use in a writeback context.  Unfortunately, this is a use case
ZFS depends on for correct operation.

Fixing this issue in the upstream kernel was pursued and patches
are available which resolve the issue.

  https://bugs.gentoo.org/show_bug.cgi?id=416685

However, these changes were rejected because upstream felt that
using __vmalloc() in the context of writeback should never be done.
Their solution was for us to rewrite parts of ZFS to accomidate
the Linux VM.

While that is probably the right long term solution, and it is
something we want to pursue, it is not a trivial task and will
likely destabilize the existing code.  This work has been planned
for the 0.7.0 release but in the meanwhile we want to improve the
SPL slab implementation to accomidate this expected ZFS usage.

This is accomplished by performing the __vmalloc() asynchronously
in the context of a work queue.  This doesn't prevent the posibility
of the worker thread from deadlocking.  However, the caller can now
safely block on a wait queue for the slab allocation to complete.

Normally this will occur in a reasonable amount of time and the
caller will be woken up when the new slab is available,.  The objects
will then get cached in the per-cpu magazines and everything will
proceed as usual.

However, if the __vmalloc() deadlocks for the reasons described
above, or is just very slow, then the callers on the wait queues
will timeout out.  When this rare situation occurs they will attempt
to kmalloc() a single minimally sized object using the GFP_NOIO flags.
This allocation will not deadlock because kmalloc() will honor the
passed flags and the caller will be able to make forward progress.

As long as forward progress can be maintained then even if the
worker thread is deadlocked the critical thread will make progress.
This will eventually allow the deadlocked worker thread to complete
and normal operation will resume.

These emergency allocations will likely be slow since they require
contiguous pages.  However, their use should be rare so the impact
is expected to be minimal.  If that turns out not to be the case in
practice further optimizations are possible.

One additional concern is if these emergency objects are long lived.
Right now they are simply tracked on a list which must be walked when
an object is freed.  Is they accumulate on a system and the list
grows freeing objects will become more expensive.  This could be
handled relatively easily by using a hash instead of a list, but that
optimization (if needed) is left for a follow up patch.

Additionally, these emeregency objects could be repacked in to existing
slabs as objects are freed if the kmem_cache_set_move() functionality
was implemented.  See issue https://github.com/zfsonlinux/spl/issues/26
for full details.  This work would also help reduce ZFS's memory
fragmentation problems.

The /proc/spl/kmem/slab file has had two new columns added at the
end.  The 'emerg' column reports the current number of these emergency
objects in use for the cache, and the following 'max' column shows
the historical worst case.  These value should give us a good idea
of how often these objects are needed.  Based on these values under
real use cases we can tune the default behavior.

Lastly, as a side benefit using a single work queue for the slab
allocations should reduce cpu contention on the global virtual address
space lock.   This should manifest itself as reduced cpu usage for
the system.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 12:00:42 -07:00
Prakash Surya 587045a638 Remove SPL_LINUX_CONFIG autoconf macro
Since removing the check for CONFIG_PREEMPT, there are no consumers of
the SPL_LINUX_CONFIG macro. As such, there is no reason to keep it
around.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #164
2012-08-27 11:58:37 -07:00
Prakash Surya e3a4360702 Revert "Make CONFIG_PREEMPT Fatal"
This reverts commit 7731d46b69.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-27 11:52:53 -07:00
Brian Behlendorf c638e9ad04 Remove autotools products
Remove all of the generated autotools products from the repository
and update the .gitignore files accordingly.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#718
2012-08-27 11:46:23 -07:00
Prakash Surya 45324c7c41 Add kpreempt_[dis|en]able macros in <sys/disp.h>
To support preempt enabled kernels in ZFS on Linux, there are a couple
places where the ZFS code needs to disable interrupts. This change adds
the Solaris preempt functions and maps them to the equivalent ZFS
functions, allowing the ZFS to make use of them.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #98
2012-08-24 15:18:38 -07:00
Prakash Surya 08850eddcb Avoid calling smp_processor_id in spl_magazine_age
The spl_magazine_age function had the implied assumption that it will
remain on its current cpu through its execution. In order to support
preempt enabled kernels, this assumption had to be removed.

The spl_kmem_magazine structure now holds the cpu id of the cpu it is
local to. This allows spl_magazine_age to use this field when scheduling
work to be done by the magazine's local cpu.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #98
2012-08-24 09:43:22 -07:00
Richard Yao 15d0411297 Remove Makefile from non-toplevel .gitignore files
When building SPL support into the kernel, ./copy-builtin will copy
non-toplevel .gitignore files. These files list /Makefile, which causes
git-archive to omit ./module/{spl,splat}/Makefile. The absence of these
files result in build failures when SPL is selected. ZFS is unaffected
because it puts Makefile in the toplevel .gitignore, which is not
copied. We fix SPL by emulating that behavior.

Reported-by: Fabio Erculiani <lxnay@gentoo.org>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #152
2012-08-23 12:49:04 -07:00
Prakash Surya 9baf44bc17 Wrap trace_set_debug_header in trace_[get|put]_tcd
To properly support CONFIG_PREEMPT enabled kernels, we must refrain from
using a CPU index when preemption is enabled. As a result, this change
moves the trace_set_debug_header call (which calls smp_processor_id)
within trace_get_tcd and trace_put_tcd (which disable and enable
preemption respectively).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #160
2012-08-23 10:01:20 -07:00
Brian Behlendorf 039bae18ca Add copy-builtin to EXTRA_DIST
The copy-builtin script was accidentally not being included in
the tarballs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #159
2012-08-23 09:59:40 -07:00
Brian Behlendorf 3a9c241e55 SPL 0.6.0-rc10 2012-08-13 16:34:39 -07:00
Brian Behlendorf 3679829092 Cleanly remove spl-modules-devel headers
Add the /usr/src/spl-<version>-<release>/<kernel> directory to
the spl-modules-devel package.  This ensures that this directory
will be removed when the package is removed.

We do not include the higher level /usr/src/spl-<version>-<release>
directory since there may be builds for multiple kernels.  Instead,
a %postun rmdir is added which attempts to remove this directory.
It will only succeed when the last spl-modules-devel-* package
for this specific release is removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-08-13 16:34:32 -07:00
Prakash Surya d83d25c2f8 Support building a spl-modules-dkms sub package
This commit adds support for building a spl-modules-dkms sub package
built around Dynamic Kernel Module Support. This is to allow building
packages using the DKMS infrastructure which is intended to ease the
burden of kernel version changes, upgrades, etc.

By default spl-modules-dkms-* sub package will be built as part of
the 'make rpm' target.  Alternately, you can build only the DKMS
module package using the 'make rpm-dkms' target.

Examples:

    # To build packaged binaries as well as a dkms packages
    $ ./configure && make rpm

    # To build only the packaged binary utilities and dkms packages
    $ ./configure && make rpm-utils rpm-dkms

Note: Only the RHEL 5/6, CHAOS 5, and Fedora distributions are
      supported for building the dkms sub package.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#535
2012-08-08 13:49:40 -07:00
Etienne Dechamps 476ff5a4da Handle any invalidate_inodes_check prototype.
In the comments of commit 723aa3b0c2,
mmatuska reported that the test for invalidate_inodes_check() is broken
if invalidate_inodes() takes two arguments.

This patch fixes the issue by resorting to another approach for
detecting invalidate_inodes_check(): is simply checks if
invalidate_inodes is defined as a macro. If it is, then it concludes
that invalidate_inodes_check() is available. This will continue to work
even if the prototype of invalidate_inodes_check() changes over time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #148
2012-08-06 11:39:49 -07:00
Richard Yao 6576a1a70d Fix incorrect type in spl_kmem_cache_set_move() parameter
A preprocessor definition renders this harmless. However, it is a good
idea to change this to be consistent.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
2012-08-01 16:35:18 -07:00
Brian Behlendorf 744038069d Merge branch 'builtin-clean'
Support in-tree builtin module building.

These commits add support for compiling the SPL module as a built-in
kernel module by copying the module code into the kernel source tree.
Here's the procedure:

  - Create your kernel configuration (`.config` file) as usual. This
    has to be done first so that SPL's configure script is able to
    detect kernel features correctly.
  - Run `make prepare scripts` inside the kernel source tree.
  - Run `./configure --enable-linux-builtin --with-linux=/usr/src/linux-...`
    inside the SPL directory.
  - Run `./copy-builtin /usr/src/linux-...` inside the SPL directory.
  - In the kernel source tree, enable the `CONFIG_SPL` option
    (e.g. using `make menuconfig`).
  - Build the kernel as usual.

SPL module parameters can be set at boot time using the following syntax
on the kernel command line: `spl.parameter_name=parameter_value`.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 15:31:02 -07:00
Etienne Dechamps a9f2397ee9 Determine the hostid on demand.
Currently, the SPL tries to determine the hostid at module load. The
hostid is usually determined by running the userland program "hostid"
during module initialization.

Unfortunately, when the module initializes, it may be way too soon to be
able to run any userland programs. This is especially true when the
module is compiled directly inside the kernel (built-in); in that case,
the SPL would try to run hostid when the kernel is still initializing,
which of course is doomed to fail.

This patch fixes the issue by deferring hostid generation until
something actually needs the hostid (that is, when zone_get_hostid() is
called), thus switching to a "on-initialization" model to a "on-demand"
(lazy loading) model. ZFS only needs the hostid when some pool
operations are requested, and this always happens way after the kernel
has finished initialization, thus solving the problem.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 15:14:02 -07:00
Etienne Dechamps c167aadb27 Add script for builtin module building.
This commit introduces a "copy-builtin" script designed to prepare a
kernel source tree for building SPL as a builtin module. The script
makes a full copy of all needed files, thus making the kernel source
tree fully independent of the spl source package.

To achieve that, some compilation flags (-include, -I) have been moved
to module/Makefile. This Makefile is only used when compiling external
modules; when compiling builtin modules, a Kbuild file generated by the
configure-builtin script is used instead. This makes sure Makefiles
inside the kernel source tree does not contain references to the spl
source package.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 15:13:09 -07:00
Etienne Dechamps 723aa3b0c2 When checking for symbol exports, try compiling.
This patch adds a new autoconf function: SPL_LINUX_TRY_COMPILE_SYMBOL.
This new function does the following:

 - Call LINUX_TRY_COMPILE with the specified parameters.
 - If unsuccessful, return false.
 - If successful and we're configuring with --enable-linux-builtin,
   return true.
 - Else, call CHECK_SYMBOL_EXPORT with the specified parameters and
   return the result.

All calls to CHECK_SYMBOL_EXPORT are converted to
LINUX_TRY_COMPILE_SYMBOL so that the tests work even when configuring
for builtin on a kernel which doesn't have loadable module support, or
hasn't been built yet.

The only exception are:

 - AC_GET_VMALLOC_INFO, because we don't even have a public header to
include in the test case, but that's okay considering this symbol can
be ignored just fine.

- SPL_AC_DEVICE_CREATE, which is legacy API for 2.6.18 kernels.  Since
kernels this old are no longer supported it should arguably just be
removed entirely from the build system.

Note that we're also checking for the correct prototype with an actual
call, which was not the case with CHECK_SYMBOL_EXPORT. However, for
"complicated" test cases like with multiple symbol versions (e.g.
vfs_fsync), we stick with the original behavior and only check for the
function's existence.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 15:12:35 -07:00
Etienne Dechamps df7cc5bc71 Fake modpost stage for LINUX_COMPILE.
Currently, when building a test case, we're compiling an entire Linux
module from beginning to end. This includes the MODPOST stage, which
generates a "conftest.mod.c" file with some boilerplate module
declaration code.

This poses a problem when configuring for built-in on kernels which have
loadable module support disabled. In this case conftest.mod.c is
referencing disabled code, resulting in a compilation failure, thus
breaking the tests.

This patch fixes the issue by faking the modpost stage when the
--enable-linux-builtin option is provided.  It does so by forcing the
modpost command to be /bin/true, and using an empty conftest.mod.c file.
The test module still compiles fine, although the result isn't loadable,
but we don't really care at this point.

Note it is important to preserve the modpost stage when building out of
tree.  This allows for the posibility of configure checks to leverage
this phase to identify GPL-only symbols.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 15:12:10 -07:00
Etienne Dechamps 0408008b33 Make configure builtin-aware.
This patch adds a new option to configure: --enable-linux-builtin. When
this option is used, the following happens:

 - Compilation of kernel modules is disabled.

 - A failure to find UTS_RELEASE is followed by a suggestion to run
   "make prepare" on the kernel source tree.

This patch also adds a new test which tries to compile an empty module
as a basic toolchain sanity test. If it fails and the option was
specified, the error is followed by a suggestion to run "make scripts"
on the kernel source tree.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 14:55:20 -07:00
Etienne Dechamps 38b5ff4d07 Fix undefined reference on spl_mutex_spin_max().
Commit 3160d4f56b changed the set of
conditions under which spl_mutex_spin_max would be implemented as a
function by changing an #if in sys/mutex.h. The corresponding
implementation file spl-mutex.c, however, has not been updated to
reflect the change. This results in undefined reference errors on
spl_mutex_spin_max under the following condition:

((!CONFIG_SMP || CONFIG_DEBUG_MUTEXES) && HAVE_MUTEX_OWNER && HAVE_TASK_CURR)

This patch fixes the issue by using the same #if in sys/mutex.h and
spl-mutex.c.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 14:54:53 -07:00
Etienne Dechamps 016432fbeb Don't build packages that haven't been selected.
Currently, when configure --with-config is used, selective compilation
is only effective for the simple "make" case. Package builders (e.g.
make rpm) still build everything (utils and modules). This patch fixes
that.

This patch also drops the duplicate rpm-modules build target.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 14:54:32 -07:00
Etienne Dechamps 94aac9c9bc Use MODULE variable in module Makefile like zfs.
In zfs, each module Makefile contains a MODULE variable which contains
the name of the module, and the following declarations reference this
variable.

In spl, there is a MODULES variable which is never used. Rename it to
MODULE and use it like in zfs. This improves consistency between the two
build systems.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#851
2012-07-26 14:53:48 -07:00
Brian Behlendorf e8267acd25 32-bit compat, hostid_read()
Explicitly cast the sizeof in hostid_read() to prevent the
following compiler warning on 32-bit systems.

  module/spl/spl-generic.c:490:10: error: format '%lu' expects
  argument of type 'long unsigned int', but argument 4 has type
  'unsigned int' [-Werror=format]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-07-20 11:14:04 -07:00
Brian Behlendorf d503b971f4 Optimize spl_rwsem_is_locked()
The spl_rwsem_is_locked() compatibility function has been observed
to be a hot spot.  The root cause of this is that we must check the
rwsem activity under the rwsem->wait_lock to avoid a race.  When
the lock is busy significant contention can occur.

The upstream kernel fix for this race had the insight that by using
spin_trylock_irqsave() this contention could be avoided.  When the
lock is contended it's reasonable to return that it is locked.

This change updates the SPLs implemention to be like the upstream
kernel.  Since the kernel code has been in use for years now this
a low risk change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-07-13 13:07:39 -07:00
Prakash Surya d801db1487 Move spl.release generation to configure step
Previously, the spl.release file was created at 'make install' time.
This is slightly problematic when the file is needed without running
'make install'. Because of this, the step creating the file was removed
from 'make install' and replaced with a more appropriate spl.release.in
file.

As a result, the spl.release file will now be created earlier as part
of the 'configure' step as opposed to the 'make install' step.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #135
2012-07-12 12:13:47 -07:00
Richard Yao 36811b4430 Detect kernels that honor gfp flags passed to vmalloc()
zfsonlinux/spl@2092cf68d8 used
PF_MEMALLOC to workaround a bug in the Linux kernel where
allocations did not honor the gfp flags passed to vmalloc().
Unfortunately, PF_MEMALLOC has the side effect of permitting
allocations to allocate pages outside of ZONE_NORMAL. This
has been observed to result in the depletion of ZONE_DMA32.

A kernel patch is available in the Gentoo bug tracker for
this issue.

  https://bugs.gentoo.org/show_bug.cgi?id=416685

This negates any benefit PF_MEMALLOC provides, so we introduce
an autotools check to disable the use of PF_MEMALLOC on
systems with patched kernels.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #126
2012-07-11 11:44:27 -07:00
Richard Yao 973e8269bd Constify memory management functions
This prevents warnings in ZFS that were caused by changes necessary to
support PaX patched kernels. When debugging is enabled, these warnings
become build failures.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #131
2012-07-03 16:07:27 -07:00
Brian Behlendorf 33f507c0f3 Remove Chaos 4.x RPM support
The Chaos 4.x distribution is based on RHEL 5.x which is no longer
supported by ZoL since it uses a 2.6.18 kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-07-02 15:17:08 -07:00
Prakash Surya 92c2f755ee Support debug and debug-devel sub packages
This commit adds support for building debug and debug-devel sub packages
of the spl-modules main package. This is to allow building packages
which are built against a debug kernel. By default, only packages are
built against a regular non-debug kernel. This can be toggled by passing
the '--with kernel-debug' parameter to rpmbuild.

Examples:

    # To build packages against only the non-debug kernel
    $ rpmbuild --rebuild --with kernel --without kernel-debug $SRPM

    # To build packages against only the debug kernel
    $ rpmbuild --rebuild --without kernel --with kernel-debug $SRPM

    # To build packages against debug and non-debug kernel
    $ rpmbuild --rebuild --with kernel --with kernel-debug $SRPM

Note: Only the RHEL 5/6, CHAOS 5, and Fedora distributions are supported
      for building the debug and debug-devel packages.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #115
2012-07-02 11:18:26 -07:00
Brian Behlendorf 44e406d712 PowerPC Compatibility
Usage of get_current() is not supported across all architectures.
The correct interface to use is the '#define current' which will
map to the appropriate function, usually current_thread_info().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #119
2012-07-02 09:33:09 -07:00
Brian Behlendorf 50fe7a010c SPL 0.6.0-rc9 2012-06-14 11:45:59 -07:00
Richard Yao e0093fea58 Linux 3.4 compat, __clear_close_on_exec replaces FD_CLR
torvalds/linux@1dce27c5aa introduced
__clear_close_on_exec() as a replacement for FD_CLR. Further commits
appear to have removed FD_CLR from the Linux source tree.  This
causes the following failure:

  error: implicit declaration of function '__FD_CLR'
  [-Werror=implicit-function-declaration]

To correct this we update the code to use the current
__clear_close_on_exec() interface for readability.  Then we introduce
an autotools check to determine if __clear_close_on_exec() is available.
If it isn't then we define some compatibility logic which used the older
FD_CLR() interface.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #124
2012-06-13 16:18:51 -07:00
Brian Behlendorf eaac9ba510 Fix uninit variable in slab reclaim test
Gcc version 4.7.0 reports the delta.tv_sec in the slab reclaim test
as potentially unitialized.  In practice this will never occur but
to keep gcc happy we initialize the variable to zero.

Signed-off-by: Brian Behlendorf <behlendo@fedora-17-amd64.(none)>
2012-06-13 16:17:22 -07:00
Brian Behlendorf 2371321e8a Fix invalid context bug
In the module unload path the vm_file_cache was being destroyed
under a spin lock.  Because this operation might sleep it was
possible, although very very unlikely, that this could result
in a deadlock.

This issue was indentified by using a Linux debug kernel and
has been fixed by moving the kmem_cache_destroy() out from under
the spin lock.  There is no need to lock this operation here.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#771
2012-06-11 09:17:45 -07:00
Jorgen Lundman 93b0dc92ea Fix ARM 64-bit division
Correctly implementating 64-bit division for ARM requires more than
just providing the __aeabi_uldivmod() and __aeabi_ldivmod() symbols.
They are need to be implemented is such a way that the quotient and
remainder and left in specific registers after the division operation
completes.  This change updates the wrapper functions to accomplish
this according to the official ARM Run-time ABI.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#706
2012-05-22 09:27:11 -07:00
Brian Behlendorf 38d31a1e57 Remove Solaris module emulation
Originally I believed that these interfaces would be needed.
However, in practice it turned out that it was more straight
forward and maintainable to use the native Linux interfaces.
As such, this is all dead code and can be safely removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #109
2012-05-18 13:57:44 -07:00
Richard Yao f90096c905 Modify KM_PUSHPAGE to use GFP_NOIO instead of GFP_NOFS
The resolution of issue #31 made KM_PUSHPAGE imply GFP_NOFS.  This
was done to prevent situations where filesystem operations which are
holding locks enter direct reclaim and attempt to reaquire those
same locks.  This clearly will result in a deadlock.

This works for datasets which are implemented in terms for filesystem
operations.  But unfortunately, swapping to a zvol will encounter
many of the same deadlocks and GFP_NOFS will not prevent this.  As
such, it is appropriate to extend KM_PUSHPAGE to use the broader
GFP_NOIO mask to handle these non-filesystem cases.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/zfs#342
Closes #105
2012-05-07 12:05:27 -07:00
Prakash Surya a9a7a01cf5 Add SPLAT test to exercise slab direct reclaim
This test is designed to verify that direct reclaim is functioning as
expected.  We allocate a large number of objects thus creating a large
number of slabs.  We then apply memory pressure and expect that the
direct reclaim path can easily recover those slabs.  The registered
reclaim function will free the objects and the slab shrinker will call
it repeatedly until at least a single slab can be freed.

Note it may not be possible to reclaim every last slab via direct reclaim
without a failure because the shrinker_rwsem may be contended.  For this
reason, quickly reclaiming 3/4 of the slabs is considered a success.

This should all be possible within 10 seconds.  For reference, on a
system with 2G of memory this test takes roughly 0.2 seconds to run.
It may take longer on larger memory systems but should still easily
complete in the alloted 10 seconds.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #107
2012-05-07 11:55:59 -07:00
Brian Behlendorf b78d4b9d98 Ensure a minimum of one slab is reclaimed
To minimize the chance of triggering an OOM during direct reclaim.
The kmem caches have been improved to make a best effort to reclaim
at least one slab when a reclaim function is registered.  This helps
avoid the case where objects are released but they are spread over
multiple slabs so no memory gets reclaimed.

Care has been taken to avoid deadlocking if the reclaim function
is unable to make forward progress.  Additionally, the reclaim
function may be skipped entirely if there are already free slabs
which can be safely reaped.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #107
2012-05-07 11:54:28 -07:00
Brian Behlendorf 06089b9e19 Ensure direct reclaim forward progress
The Linux direct reclaim path uses this out of band value to
determine if forward progress is being made.  Normally this is
incremented by kmem_freepages() which is part of the various
Linux slab implementations.  However, since we are using none
of that infrastructure we're responsible for incrementing this
count.

If no forward progress is detected and a subsequent allocation
fails the OOM killer will be invoked.  If there was forward
progress additional reclaim will be attempted via the page
cache and registerd shrinker until the allocation succeeds.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #107
2012-05-07 11:54:19 -07:00
Prakash Surya c0e0fc14e3 Ignore slab cache age and delay in direct reclaim
When memory pressure triggers direct memory reclaim, a slabs age
and delay should not prevent it from being freed. This patch ensures
these values are ignored, allowing an empty slab to be freed in this
code path no matter the value of its age and delay.

This prevents needless scanning of the partial slabs and has been
observed to significantly reduce the total cpu usage.  In addition,
it should allow for snappier reclaim under memory pressure.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #102
2012-05-07 11:50:04 -07:00
Prakash Surya cef7605c34 Throttle number of freed slabs based on nr_to_scan
Previously, the SPL tried to maintain Solaris semantics by freeing
all available (empty) slabs from its slab caches when the shrinker
was called. This is not desirable when running on Linux. To make
the SPL shrinker more Linux friendly, the actual number of freed
slabs from each of the slab caches is now derived from nr_to_scan
and skc_slab_objs.

Additionally, an accounting bug was fixed in spl_slab_reclaim()
which could cause us to reclaim one more slab than requested.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #101
2012-05-07 11:46:15 -07:00
Jorgen Lundman ef6f91ce0c Add missing 64-bit divide for 32-bit ARM
Leverage the existing generic 64-bit division operations which
were originally implemented for x86 to support ARM.  All that is
required is to make the symbols available to the linker with the
expected names.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-05-03 10:07:54 -07:00
Jorgen Lundman cb75844e85 Define the needed ISA types for ARM
Add the minimum required ISA types to support the ARM architecture.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-05-03 09:56:15 -07:00
Prakash Surya 05b8f50c33 Update a comment to reflect new taskq internals
As of the removal of the taskq work list made in commit:

    commit 2c02b71b14
    Author: Prakash Surya <surya1@llnl.gov>
    Date:   Mon Dec 5 17:32:48 2011 -0800

        Replace tq_work_list and tq_threads in taskq_t

        To lay the ground work for introducing the taskq_dispatch_prealloc()
        interface, the tq_work_list and tq_threads fields had to be replaced
        with new alternatives in the taskq_t structure.

the comment above taskq_wait_check has been incorrect. This change is an
attempt at bringing that description more in line with the current
implementation. Essentially, references to the old task work list had to
be updated to reference the new taskq thread active list.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2012-04-30 10:49:15 -07:00
Brian Behlendorf b29012b999 Remove condition variable names
Long ago I added support to the spl for condition variable names
because I thought they might be needed.  It turns out they aren't.
In fact the official Solaris cv_init(9F) man page discourages
their use in the kernel.

  cv_init(9F)
    Parameters
      name - Descriptive string. This is obsolete and should be
             NULL. (Non-NULL strings are legal, but they're a
             waste of kernel memory.)

Therefore, I'm removing them from the spl to reclaim this memory
and adding an ASSERT() to ensure no new consumers are added which
make use of the name.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-04-06 12:06:19 -07:00
Brian Behlendorf 8920c6918a SPL 0.6.0-rc8 2012-03-26 11:57:13 -07:00
Brian Behlendorf 0835057ee7 Add SPL_META_RELEASE to module load/unload messages
Include the ZFS_META_RELEASE in the module load/unload messages
to more clearly indicate exactly what version of the SPL has
been loaded.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-23 12:11:50 -07:00
Brian Behlendorf 5d139aaa2b SPL 0.6.0-rc7 2012-03-16 11:28:28 -07:00
Brian Behlendorf a3a69b74cd Fix distribution detection
Improve the distribution detection by moving the tests for
distribution specific files first.  The Ubuntu and Debian
checks are left for last because they are the least likely
to be unique.  This is particularly true in the case of Debian
since so many distributions are based on Debian.

Since this is currently only used to identify the correct
packaging method for this system the result in many instances
is simply cosmetic.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-03-05 10:38:38 -08:00
Brian Behlendorf 3c208a5480 Cleanly support debug packages
Allow a source rpm to be rebuilt with debugging enabled.  This
avoids the need to have to manually modify the spec file.  By
default debugging is still largely disabled.  To enable specific
debugging features use the following options with rpmbuild.

  '--with debug'               - Enables ASSERTs
  '--with debug-log'           - Enables the internal debug log
  '--with debug-kmem'          - Enables basic memory accounting
  '--with debug-kmem-tracking' - Enables detailed memory tracking

  # For example:
  $ rpmbuild --rebuild --with debug spl-modules-0.6.0-rc6.src.rpm

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-27 14:24:22 -08:00
Brian Behlendorf feedc43601 Add missing spl_debug_* helpers
When building the spl with --disable-debug-log the __SDEBUG()
macro and spl_debug_* helper functions were undefined.  This
change adds the missing functions so the upper layers compiling
against the spl don't need to be aware of how the spl was built.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-09 16:41:46 -08:00
Brian Behlendorf 9a8b7a7458 Add basic dynamic kstat support
Add the bare minimum functionality to support dynamic kstats.  A
complete kstat implementation should be done as part of issue #84.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #84
2012-02-02 11:28:00 -08:00
Brian Behlendorf 4b2220f0b9 Add --enable-debug-log configure option
Until now the notion of an internal debug logging infrastructure
was conflated with enabling ASSERT()s.  This patch clarifies things
by cleanly breaking the two subsystem apart.  The result of this
is the following behavior.

--enable-debug      - Enable/disable code wrapped in ASSERT()s.
--disable-debug       ASSERT()s are used to check invariants and
                      are never required for correct operation.
                      They are disabled by default because they
                      may impact performance.

--enable-debug-log  - Enable/disable the debug log infrastructure.
--disable-debug-log   This infrastructure allows the spl code and
                      its consumer to log messages to an in-kernel
                      log.  The granularity of the logging can be
                      controlled by a debug mask.  By default the
                      mask disables most debug messages resulting
                      in a negligible performance impact.  Because
                      of this the debug log is enabled by default.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-02 11:27:54 -08:00
Ned Bass 3c6ed5410b Taskq locking optimizations
Testing has shown that tq->tq_lock can be highly contended when a
large number of small work items are dispatched.  The lock hold time
is reduced by the following changes:

1) Use exclusive threads in the work_waitq

When a single work item is dispatched we only need to wake a single
thread to service it.  The current implementation uses non-exclusive
threads so all threads are woken when the dispatcher calls wake_up().
If a large number of threads are in the queue this overhead can become
non-negligible.

2) Conditionally add/remove threads from work waitq

Taskq threads need only add themselves to the work wait queue if
there are no pending work items.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
2012-01-19 14:42:49 -08:00
Ned Bass 0bb43ca282 Revert "Taskq locking optimizations"
This reverts commit ec2b41049f.

A race condition was introduced by which a wake_up() call can be lost
after the taskq thread determines there is no pending work items,
leading to deadlock:

1. taksq thread enables interrupts
2. dispatcher thread runs, queues work item, call wake_up()
3. taskq thread runs, adds self to waitq, sleeps

This could easily happen if an interrupt for an IO completion was
outstanding at the point where the taskq thread reenables interrupts,
just before the call to add_wait_queue_exclusive().  The handler would
run immediately within the race window.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
2012-01-19 14:42:39 -08:00
Brian Behlendorf 87d1123924 Fix rpm dependencies
This change updates the rpm spec files to have strictly correct
package dependencies.  That means a few things:

* Add a dependency to the spl package for the spl-modules package.
  This ensures that when running 'yum install spl' that newest
  version of the spl-modules will be installed.

* Remove the redundant distribution release extension.  This
  is already added once because it is part of the kernel package
  release name.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-01-18 11:24:36 -08:00
Brian Behlendorf a2eda2ff48 Add the release component to headers
When the original build system code was added the release
component was accidentally omited from the development header
install path.  This patch adds the missing path component so
it's always clear exactly what release your compiling against.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-01-18 11:06:26 -08:00
Ned Bass ec2b41049f Taskq locking optimizations
Testing has shown that tq->tq_lock can be highly contended when a
large number of small work items are dispatched.  The lock hold time
is reduced by the following changes:

1) Use exclusive threads in the work_waitq

When a single work item is dispatched we only need to wake a single
thread to service it.  The current implementation uses non-exclusive
threads so all threads are woken when the dispatcher calls wake_up().
If a large number of threads are in the queue this overhead can become
non-negligible.

2) Conditionally add/remove threads from work waitq outside of tq_lock

Taskq threads need only add themselves to the work wait queue if there
are no pending work items.  Furthermore, the add and remove function
calls can be made outside of the taskq lock since the wait queues are
protected from concurrent access by their own spinlocks.

3) Call wake_up() outside of tq->tq_lock

Again, the wait queues are protected by their own spinlock, so the
dispatcher functions can drop tq->tq_lock before calling wake_up().

A new splat test taskq:contention was added in a prior commit to measure
the impact of these changes.  The following table summarizes the
results using data from the kernel lock profiler.

                        tq_lock time    %diff   Wall clock (s)  %diff
original:               39117614.10     0       41.72           0
exclusive threads:      31871483.61     18.5    34.2            18.0
unlocked add/rm waitq:  13794303.90     64.7    16.17           61.2
unlocked wake_up():     1589172.08      95.9    16.61           60.2

Each row reflects the average result over 5 test runs.
/proc/lock_stats was zeroed out before and collected after each run.
Column 1 is the cumulative hold time in microseconds for tq->tq_lock.
The tests are cumulative; each row reflects the code changes of the
previous rows.  %diff is calculated with respect to "original" as
100*(orig-new)/orig.

Although calling wake_up() outside of the taskq lock dramatically
reduced the taskq lock hold time, the test actually took slightly more
wall clock time.  This is because the point of contention shifts from
the taskq lock to the wait queue lock.  But the change still seems
worthwhile since it removes our taskq implementation as a bottleneck,
assuming the small increase in wall clock time to be statistical
noise.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #32
2012-01-18 10:36:57 -08:00
Ned Bass cf5d23fa1e Add taskq contention splat test
Add a test designed to generate contention on the taskq spinlock by
using a large number of threads (100) to perform a large number (131072)
of trivial work items from a single queue.  This simulates conditions
that may occur with the zio free taskq when a 1TB file is removed from a
ZFS filesystem, for example.  This test should always pass.  Its purpose
is to provide a benchmark to easily measure the effectiveness of taskq
optimizations using statistics from the kernel lock profiler.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #32
2012-01-18 10:36:51 -08:00
Darik Horn 966e5200d3 Fix make distclean for --with-config=user
Apply the same fix to SPL that was applied to ZFS earlier at:
zfsonlinux/zfs@d433c20651

Additionally quote @LINUX_SYMBOLS@ because it is a null substitution
in this configuration, which results in a `[ -f  ]` expression that
incorrectly evaluates to true.

  # ./configure --with-config=user
  # make distclean

  Making distclean in module
  make[1]: Entering directory `/spl/module'
  make -C  SUBDIRS=`pwd`  clean
  make: Entering an unknown directory
  make: *** SUBDIRS=/spl/module: No such file or directory.  Stop.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-01-17 10:06:00 -08:00
Brian Behlendorf 0b14b9f327 Run SPL_AC_PACMAN only if $VENDOR is "arch"
Unfortunately, Arch's package manager `pacman` shares it's name with a
popular arcade video game. Thus, in order to refrain from executing the
video game when we mean to execute the package manager, SPL_AC_PACMAN is
now only run when $VENDOR is determined to be "arch".

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#517
2012-01-13 09:08:12 -08:00
Darik Horn 588d900433 Linux 3.2 compat: rw_semaphore.wait_lock is raw
The wait_lock member of the rw_semaphore struct became a raw_spinlock_t
in Linux 3.2 at torvalds/linux@ddb6c9b58a.

Wrap spin_lock_* function calls in a new spl_rwsem_* interface to
ensure type safety if raw_spinlock_t becomes architecture specific,
and to satisfy these compiler warnings:

  warning: passing argument 1 of ‘spinlock_check’
    from incompatible pointer type [enabled by default]
  note: expected ‘struct spinlock_t *’
    but argument is of type ‘struct raw_spinlock_t *’

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #76
Closes: zfsonlinux/zfs#463
2012-01-11 16:28:05 -08:00
Brian Behlendorf 5f6c14b1ed Proxmox VE kernel compat, invalidate_inodes()
The Proxmox VE kernel contains a patch which renames the function
invalidate_inodes() to invalidate_inodes_check().  In the process
it adds a 'check' argument and a '#define invalidate_inodes(x)'
compatibility wrapper for legacy callers.  Therefore, if either
of these functions are exported invalidate_inodes() can be
safely used.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #58
2011-12-21 14:29:45 -08:00
Prakash Surya cd2817f8a6 Move Arch Linux's VENDOR check above Ubuntu's
If the lsb-release package is installed on an Arch Linux distribution,
the configure step will incorrectly detect the running distribution as
Ubuntu. This is a result of both distributions providing an
/etc/lsb-release file, and the Ubuntu VENDOR check being performed
first.

Since the Arch Linux test check's for a file more specific to the Arch
Linux distribution, moving Arch Linux's VENDOR check above Unbuntu's
check provides a quick and easy solution.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #72
2011-12-19 12:03:40 -08:00
Prakash Surya 8f2503e0af Store copy of tqent_flags prior to servicing task
A preallocated taskq_ent_t's tqent_flags must be checked prior to
servicing the taskq_ent_t. Once a preallocated taskq entry is serviced,
the ownership of the entry is handed back to the caller of
taskq_dispatch, thus the entry's contents can potentially be mangled.

In particular, this is a problem in the case where a preallocated taskq
entry is serviced, and the caller clears it's tqent_flags field. Thus,
when the function returns and task_done is called, it looks as though
the entry is **not** a preallocated task (when in fact it **is** a
preallocated task).

In this situation, task_done will place the preallocated taskq_ent_t
structure onto the taskq_t's free list. This is a **huge** mistake. If
the taskq_ent_t is then freed by the caller of taskq_dispatch, the
taskq_t's free list will hold a pointer to garbage data. Even worse, if
nothing has over written the freed memory before the pointer is
dereferenced, it may still look as though it points to a valid list_head
belonging to a taskq_ent_t structure.

Thus, the task entry's flags are now copied prior to servicing the task.
This copy is then checked to see if it is a preallocated task, and
determine if the entry needs to be passed down to the task_done
function.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #71
2011-12-16 16:54:00 -08:00
Prakash Surya e7e5f78e7b Swap taskq_ent_t with taskqid_t in taskq_thread_t
The taskq_t's active thread list is sorted based on its
tqt_ent->tqent_id field. The list is kept sorted solely by inserting
new taskq_thread_t's in their correct sorted location; no other
means is used. This means that once inserted, if a taskq_thread_t's
tqt_ent->tqent_id field changes, the list runs the risk of no
longer being sorted.

Prior to the introduction of the taskq_dispatch_prealloc() interface,
this was not a problem as a taskq_ent_t actively being serviced under
the old interface should always have a static tqent_id field. Thus,
once the taskq_thread_t is added to the taskq_t's active thread list,
the taskq_thread_t's tqt_ent->tqent_id field would remain constant.

Now, this is no longer the case. Currently, if using the
taskq_dispatch_prealloc() interface, any given taskq_ent_t actively
being serviced _may_ have its tqent_id value incremented. This happens
when the preallocated taskq_ent_t structure is recursively dispatched.
Thus, a taskq_thread_t could potentially have its tqt_ent->tqent_id
field silently modified from under its feet. If this were to happen
to a taskq_thread_t on a taskq_t's active thread list, this would
compromise the integrity of the order of the list (as the list
_may_ no longer be sorted).

To get around this, the taskq_thread_t's taskq_ent_t pointer was
replaced with its own static copy of the tqent_id. So, as a taskq_ent_t
is pulled off of the taskq_t's pending list, a static copy of its
tqent_id is made and this copy is used to sort the active thread
list. Using a static copy is key in ensuring the integrity of the
order of the active thread list. Even if the underlying taskq_ent_t
is recursively dispatched (as has its tqent_id modified), this
static copy stored inside the taskq_thread_t will remain constant.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #71
2011-12-16 13:26:54 -08:00
Prakash Surya c2dceb5cd5 Add make rule for building Arch Linux packages
Added the necessary build infrastructure for building packages
compatible with the Arch Linux distribution. As such, one can now run:

    $ ./configure
    $ make pkg     # Alternatively, one can run 'make arch' as well

on an Arch Linux machine to create two binary packages compatible with
the pacman package manager, one for the spl userland utilties and
another for the spl kernel modules. The new packages can then be
installed by running:

    # pacman -U $package.pkg.tar.xz

In addition, source-only packages suitable for an Arch Linux chroot
environment or remote builder can also be built using the 'sarch' make
rule.

NOTE: Since the source dist tarball is created on the fly from the head
of the build tree, it's MD5 hash signature will be continually influx.
As a result, the md5sum variable was intentionally omitted from the
PKGBUILD files, and the '--skipinteg' makepkg option is used. This may
or may not have any serious security implications, as the source tarball
is not being downloaded from an outside source.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #68
2011-12-14 16:44:10 -08:00
Prakash Surya 699d5ee8a9 Exercise new taskq interface in splat-taskq tests
The splat-taskq test functions were slightly modified to exercise
the new taskq interface in addition to the old interface.  If the
old interface passes each of its tests, the new interface is
exercised.  Both sub tests (old interface and new interface) must
pass for each test as a whole to pass.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #65
2011-12-13 16:10:57 -08:00
Prakash Surya 44217f7aad Implement taskq_dispatch_prealloc() interface
This patch implements the taskq_dispatch_prealloc() interface which
was introduced by the following illumos-gate commit.  It allows for
a preallocated taskq_ent_t to be used when dispatching items to a
taskq.  This eliminates a memory allocation which helps minimize
lock contention in the taskq when dispatching functions.

    commit 5aeb94743e3be0c51e86f73096334611ae3a058e
    Author: Garrett D'Amore <garrett@nexenta.com>
    Date:   Wed Jul 27 07:13:44 2011 -0700

    734 taskq_dispatch_prealloc() desired
    943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2011-12-13 16:10:57 -08:00
Prakash Surya ac1e5b6033 Add Test: "Single task queue, recursive dispatch"
Added another splat taskq test to ensure tasks can be recursively
submitted to a single task queue without issue. When the
taskq_dispatch_prealloc() interface is introduced, this use case
can potentially cause a deadlock if a taskq_ent_t is dispatched
while its tqent_list field is not empty. This _should_ never be
a problem with the existing taskq_dispatch() interface.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2011-12-13 16:10:57 -08:00
Prakash Surya 2c02b71b14 Replace tq_work_list and tq_threads in taskq_t
To lay the ground work for introducing the taskq_dispatch_prealloc()
interface, the tq_work_list and tq_threads fields had to be replaced
with new alternatives in the taskq_t structure.

The tq_threads field was replaced with tq_thread_list. Rather than
storing the pointers to the taskq's kernel threads in an array, they are
now stored as a list. In addition to laying the ground work for the
taskq_dispatch_prealloc() interface, this change could also enable taskq
threads to be dynamically created and destroyed as threads can now be
added and removed to this list relatively easily.

The tq_work_list field was replaced with tq_active_list. Instead of
keeping a list of taskq_ent_t's which are currently being serviced, a
list of taskq_threads currently servicing a taskq_ent_t is kept. This
frees up the taskq_ent_t's tqent_list field when it is being serviced
(i.e. now when a taskq_ent_t is being serviced, it's tqent_list field
will be empty).

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2011-12-13 16:10:50 -08:00
Prakash Surya 046a70c93b Replace struct spl_task with struct taskq_ent
The spl_task structure was renamed to taskq_ent, and all of
its fields were renamed to have a prefix of 'tqent' rather
than 't'. This was to align with the naming convention which
the ZFS code assumes.  Previously these fields were private
so the name never mattered.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #65
2011-12-13 12:28:09 -08:00
Prakash Surya ed948fa72b Add SPLAT_TEST_FINI call for SPLAT_TASKQ_TEST6_ID
This change adds the neglected SPLAT_TEST_FINI call for the
SPLAT_TASKQ_TEST6_ID, just as is done for the other 5 SPLAT_TASKQ_*
tests.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #64
2011-12-13 12:26:16 -08:00
Prakash Surya 93806f58a6 Fix usage of MUTEX macro in mutex_enter_nested
A call site of the MUTEX macro had incorrectly placed its closing
parenthesis, causing two parameters to be passed rather than one. This
change moves the misplaced parenthesis to fix the typographical error.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #70
2011-12-13 11:04:21 -08:00
Chris Dunlop 791dc876eb Allow 64-bit timestamps to be set on 64-bit kernels
ZFS and 64-bit linux are perfectly capable of dealing with 64-bit
timestamps, but ZFS deliberately prevents setting them.  Adjust
the SPL such that TIMESPEC_OVERFLOW will not always assume 32-bit
values and instead use the correct values for your kernel build.
This effectively allows 64-bit timestamps on 64-bit systems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #487
2011-12-12 11:06:03 -08:00
Prakash Surya e05bec805b Fix a typo referencing an incorrect symbol
The splat_taskq_test4_common function was incorrectly referencing
the splat_taskq-test13_func symbol, when it meant to be using the
splat_taskq_test4_func symbol.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #61
2011-11-21 16:52:36 -08:00
Brian Behlendorf 1114ae6ae7 Prepend spl_ to all init/fini functions
This is a bit of cleanup I'd been meaning to get to for a while
to reduce the chance of a type conflict.  Well that conflict
finally occurred with the kstat_init() function which conflicts
with a function in the 2.6.32-6-pve kernel.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #56
2011-11-11 09:18:28 -08:00
Brian Behlendorf 948914d2f1 Fix depmod warning
The depmod utility from module-init-tools 3.12-pre3 generates a
warning when the -e option is used without -E or -F.  This was
observed under OpenSuse 11.4.  To resolve the issue when the
exact System.map-* for your kernel cannot be found fallback to
a generic safe '/sbin/depmod -a'.

  WARNING: -e needs -E or -F

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-11-10 10:36:30 -08:00
Brian Behlendorf fe71c0e567 Linux 3.1 compat, shrink_*cache_memory
As of Linux 3.1 the shrink_dcache_memory and shrink_icache_memory
functions have been removed.  This same task is now accomplished
more cleanly with per super block shrinkers.  This unfortunately
leaves us no easy way to support the dnlc_reduce_cache() function.

This support has always been entirely optional.  So when no
reasonable interface is available allow the dnlc_reduce_cache()
function to effectively become a no-op.

The downside of this change is that it will prevent the zfs arc
meta data limts from being enforced.  However, the current zfs
implementation in this regard is already flawed and needs to
be reworked.  If the arc needs to enfore a meta data limit it
will need to be extended to coordinate directly with the zpl.
This will allow us to drop all this compatibility code and get
more fine grained control over the cache management.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
2011-11-09 19:36:30 -08:00
Brian Behlendorf 0d0b523728 Linux 3.1 compat, vfs_fsync()
Preferentially use the vfs_fsync() function.  This function was
initially introduced in 2.6.29 and took three arguments.  As
of 2.6.35 the dentry argument was dropped from the function.
For older kernels fall back to using file_fsync() which also
took three arguments including the dentry.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
2011-11-09 19:36:21 -08:00
Brian Behlendorf 12ff95ff57 Linux 3.1 compat, kern_path_parent()
Prior to Linux 3.1 the kern_path_parent symbol was exported for
use by kernel modules.  As of Linux 3.1 it is now longer easily
available.  To handle this case the spl will now dynamically
look up address of the missing symbol at module load time.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #52
2011-11-09 16:51:25 -08:00
Brian Behlendorf b8b6e4c453 Fix NULL deref in balance_pgdat()
Be careful not to unconditionally clear the PF_MEMALLOC bit in
the task structure.  It may have already been set when entering
kv_alloc() in which case it must remain set on exit.  In
particular the kswapd thread will have PF_MEMALLOC set in
order to prevent it from entering direct reclaim.  By clearing
it we allow the following NULL deref to potentially occur.

  BUG: unable to handle kernel NULL pointer dereference at (null)
  IP: [<ffffffff8109c7ab>] balance_pgdat+0x25b/0x4ff

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #287
2011-11-03 09:50:22 -07:00
Brian Behlendorf 16952a68f2 Include distribution in release
Common practice is to include the distribution in the package release.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-19 11:41:15 -07:00
Gunnar Beutner f5e76dea03 Cleaned up MUTEX() #define
The old define assumed a specific layout of the kmutex_t struct. This
patch makes the macro independent from the actual struct layout.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-19 09:59:32 -07:00
Gunnar Beutner 66cdc93b8c Remove the spinlocks for mutex_enter()/mutex_exit()
The m_owner variable is protected by the mutex itself. Reading the variable
is guaranteed to be atomic (due to it being a word-sized reference) and
ACCESS_ONCE() takes care of read cache effects.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-19 09:58:57 -07:00
Gunnar Beutner 3160d4f56b Fix race condition in mutex_exit()
On kernels with CONFIG_DEBUG_MUTEXES mutex_exit() clears the mutex
owner after releasing the mutex. This would cause mutex_owner()
to return an incorrect owner if another thread managed to lock the
mutex before mutex_exit() had a chance to clear the owner.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #167
2011-10-19 09:58:41 -07:00
Gunnar Beutner f3989ed322 vn_rdwr() didn't properly advance the file position
This would cause problems when using 'zfs send' with a file as the
target (rather than a pipe or a socket as is usually the case) as
for each write the destination offset in the file would be 0.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #391
2011-10-18 16:51:35 -07:00
Brian Behlendorf a49bc99689 Fix package URLs to use the github repository
The URL field in the spl-modules and spl package spec files were
updated to point to the ZFS on Linux repository hosted by github.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-17 16:42:50 -07:00
Brian Behlendorf ecc3981007 Fix various typos in comments
Just clean up some of the typos and spelling mistakes in the
comments of spl-kmem.c.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-11 10:32:49 -07:00
Gunnar Beutner 8d177c181f Fixed typo in spl_slab_alloc()
The typo did not have any effect (apart from a negligible performance
impact) because skc->skc_flags * KMC_OFFSLAB is always non-null when
at least one bit in skc->skc_flags is set.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-11 10:03:43 -07:00
Gunnar Beutner 64c075c3f4 Properly destroy work items in spl_kmem_cache_destroy()
In a non-debug build the ASSERT() would be optimized away
which could cause pending work items to not be cancelled.

We must also use cancel_delayed_work_sync() rather than just
cancel_delayed_work() to actually wait until work items have
completed.  Otherwise they might accidentally access free'd
memory.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS bugs #279, #62, #363, #418
2011-10-11 09:59:19 -07:00
Gunnar Beutner 763b2f3b57 Fixed invalid resource re-use in file_find()
File descriptors are a per-process resource. The same descriptor
in different processes can refer to different files. find_file()
incorrectly assumed that file descriptors are globally unique.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes ZFS issue #386
2011-10-11 09:51:51 -07:00
Brian Behlendorf 4a777c028c Prep spl-0.6.0-rc6 tag
Create the sixth 0.6.0 release candidate tag (rc6).
2011-10-06 14:58:09 -07:00
Brian Behlendorf 6b3b569df3 Remove /etc/hostid missing warning
No longer print the following warning to the console when the
/etc/hostid file is missing.  This is the expected default behavior.
Keeping the hostid in sync with the initramfs is now accomplished
by creating the /etc/hostid in the initramfs not on the system.

  SPL: The /etc/hostid file is not found.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-10-06 14:58:09 -07:00
Brian Behlendorf 39a87c6921 Revert "Stabilize the hostid for RPM installations."
Creating an /etc/hostid file as part of the rpm post install
causes problems for diskless systems which are sharing an image.
While it's still critical to ensure the hostid doesn't change
for zfs root filesystems.  This will now be done by setting
the /etc/hostid in the initramfs created by dracut.

This reverts commit 79593b0dec.
2011-09-30 09:36:35 -07:00
Brian Behlendorf 97fd6a07c2 Fix HAVE_FS_STRUCT_SPINLOCK check for gcc-4.1.2
Older versions of gcc (gcc-4.1.2) will treat an 'incompatible
pointer type' as a warning instead of an error.  This results
in HAVE_FS_STRUCT_SPINLOCK being defined incorrectly.  This
failure mode was observed when using a RHEL6 2.6.32 based kernel
under RHEL5.5 which contains the old version of gcc.  To resolve
the issue the warning is explicitly promoted to an error.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-09-19 13:45:08 -07:00
Brian Behlendorf c064bdee95 Fix the configure CONFIG_* option detection
The latest kernels no longer define AUTOCONF_INCLUDED which was
being used to detect the new style autoconf.h kernel configure
options.  This results in the CONFIG_* checks always failing
incorrectly for newer kernels.

The fix for this is a simplification of the testing method.
Rather than attempting to explicitly include to renamed config
header.  It is simpler to unconditionally include <linux/module.h>
which must pick up the correctly named header.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #320
2011-07-22 15:07:03 -07:00
Brian Behlendorf e80cd06b8e Fix 'make install' overly broad 'rm'
When running 'make install' without DESTDIR set the module install
rules would mistakenly destroy the 'modules.*' files for ALL of
your installed kernels.  This could lead to a non-functional system
for the alternate kernels because 'depmod -a' will only be run for
the kernel which was compiled against.  This issue would not impact
anyone using the 'make <deb|rpm|pkg>' build targets to build and
install packages.

The fix for this issue is to only remove extraneous build products
when DESTDIR is set.  This almost exclusively indicates we are
building packages and installed the build products in to a temporary
staging location.  Additionally, limit the removal the unneeded
build products to the target kernel version.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #328
2011-07-20 09:37:41 -07:00
Brian Behlendorf d9365224ea Prep spl-0.6.0-rc5 tag
Create the fifth 0.6.0 release candidate tag (rc5).
2011-07-01 15:23:17 -07:00
Brian Behlendorf 86fd39f354 Linux 2.6.39 compat, mutex owner
Prior to Linux 2.6.39 when CONFIG_DEBUG_MUTEXES was defined
the kernel stored a thread_info pointer as the mutex owner.
From this you could get the pointer of the current task_struct
to compare with get_current().

As of Linux 2.6.39 this behavior has changed and now the mutex
stores a pointer to the task_struct.  This commit detects the
type of pointer stored in the mutex and adjusts the mutex_owner()
and mutex_owned() functions to perform the correct comparision.
2011-06-24 13:00:08 -07:00
Darik Horn 79593b0dec Stabilize the hostid for RPM installations.
ZFS requires a stable hostid to recognize foreign pool imports,
but the hostid of a Linux system can change if the /etc/hostid
file is missing, particularly during DHCP lease updates.

Ensure that the system hostid is stable by creating the
/etc/hostid file from the output of the /usr/bin/hostid utility.
The /sbin/genhostid utility that is provided by the initscripts
package is not used because it creates a random hostid, which
breaks upgrades on systems that already have the SPL module
installed.

The external `printf` is used because the dash builtin lacks
the byte format.  Conveniences like a ${HOSTID:$ii:2} substring
range or a `sed` one-liner are similarly avoided.
2011-06-24 09:58:08 -07:00
Darik Horn 0d54dcb566 Read the /etc/hostid file directly.
Deprecate the /usr/bin/hostid call by reading the /etc/hostid file
directly. Add the spl_hostid_path parameter to override the default
/etc/hostid path.

Rename the set_hostid() function to hostid_exec() to better reflect
actual behavior and complement the new hostid_read() function.

Use HW_INVALID_HOSTID as the spl_hostid sentinel value because
zero seems to be a valid gethostid() result on Linux.
2011-06-24 09:58:03 -07:00
Brian Behlendorf bf0c60c060 Add linux compatibility tests
While the splat tests were originally designed to stress test
the Solaris primatives.  I am extending them to include some kernel
compatibility tests.  Certain linux APIs have changed frequently.
These tests ensure that added compatibility is working properly
and no unnoticed regression have slipped in.

Test 1 and 2 add basic regression tests for shrink_icache_memory
and shrink_dcache_memory.  These are simply functional tests to
ensure we can call these functions safely.  Checking for correct
behavior is more difficult since other running processes will
influence the behavior.  However, these functions are provided
by the kernel so if we can successfully call them we assume they
are working correctly.

Test 3 checks that shrinker functions are being registered and
called correctly.  As of Linux 3.0 the shrinker API has changed
four different times so I felt the need to add a trivial test
case to ensure each variant works as expected.
2011-06-21 14:02:46 -07:00
Brian Behlendorf a55bcaad18 Linux 3.0: Shrinker compatibility
Update the the wrapper macros for the memory shrinker to handle
this 4th API change.  The callback function now takes a
shrink_control structure.  This is certainly a step in the
right direction but it's annoying to have to accomidate yet
another version of the API.
2011-06-21 14:02:39 -07:00
Brian Behlendorf a32661a6c9 Avoid 'rpm -q' bug for 'make pkg'
RPM version 4.9.0 has been observed to generate extra debug
messages in certain cases.  These debug messages prevent us
from cleanly acquiring the architecture.  This is clearly
an upstream RPM bug which will get fixed.  But until then
a safe solution is to pipe the result through 'tail -1'
to just grab the architecture bit we care about.

Example 'rpm -qp spl-0.6.0-rc4.src.rpm --qf %{arch}' output:

Freeing read locks for locker 0x166: 28031/47480843735008
Freeing read locks for locker 0x168: 28031/47480843735008
x86_64
2011-06-16 11:49:38 -07:00
Brian Behlendorf 372c257233 Add TASKQ_NORECLAIM flag
It has become necessary to be able to optionally disable
direct memory reclaim for certain taskqs.  To support
this the TASKQ_NORECLAIM flags has been added which sets
the PF_MEMALLOC bit for all threads in the taskq.
2011-05-06 15:23:58 -07:00
Brian Behlendorf dde6b7b137 Prep spl-0.6.0-rc4 tag
Create the fourth 0.6.0 release candidate tag (rc4).
2011-05-03 10:31:12 -07:00
Brian Behlendorf c1f95c2b94 Correct MAXUID
The uid_t on most systems is in fact and unsigned 32-bit value.
This is almost always correct, however you could compile your
kernel to use an unsigned 16-bit value for uid_t.  In practice
I've never encountered a distribution which does this so I'm
willing to overlook this corner case for now.
2011-04-29 13:58:45 -07:00
Gunnar Beutner 9d4b7c17a0 Renamed 'struct fid' for NFS
Renamed 'struct fid' because its name conflicts with another
struct in the Linux kernel headers.  The fid_t typedef remains
unchanged intentionally.
2011-04-29 12:10:54 -07:00
Brian Behlendorf 4c16d2471a Merged pull request #40 from dajhorn/spl-proc-typos.
Correct typos in the spl proc handler.
2011-04-25 14:51:48 -07:00
Darik Horn c95b308d12 Correct typos in the spl proc handler.
Correct a format typo that causes /proc/sys/kernel/spl/hostid
to return a decimal number instead of a hexadecimal number.
2011-04-24 20:56:07 -05:00
Brian Behlendorf d837ae395b Fix 32-bit MAXOFFSET_T definition
The correct definition of MAXOFFSET_T under Solaris is in reality
tied to the maximum size of a 'long long' type.  With this in mind
MAXOFFSET_T is now defined as LLONG_MAX which ensures the correct
value is used on both 32-bit and 64-bit systems.
2011-04-22 16:17:13 -07:00
Darik Horn 5b8f76ea16 Make the SPL kernel messages consistent with ZFS.
Change the SPL kernel messages for module loading and module
unloading so that they are similar to the ZFS kernel messages.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-21 09:41:13 -07:00
Darik Horn ad35b6a6e9 Remove the gawk dependency.
This reverts commit 1814251453.

Demote the gawk call back to awk and ensure that stderr is attached.  GNU gawk
tolerates a missing stderr handle, but many utilities do not, which could be
why a regular awk call was unexplainably failing on some systems.

Use argv[0] instead of sh_path for consistency internally and with other Linux
drivers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-21 09:41:09 -07:00
Darik Horn fa6f7d8f9d Import spl_hostid as a module parameter.
Provide a call_usermodehelper() alternative by letting the hostid be passed as
a module parameter like this:

  $ modprobe spl spl_hostid=0x12345678

Internally change the spl_hostid variable to unsigned long because that is the
type that the coreutils /usr/bin/hostid returns.

Move the hostid command into GET_HOSTID_CMD for consistency with the similar
GET_KALLSYMS_ADDR_CMD invocation.

Use argv[0] instead of sh_path for consistency internally and with other Linux
drivers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-04-21 09:41:01 -07:00
Brian Behlendorf 3dfc591ac4 Linux 2.6.39 compat, zlib_deflate_workspacesize()
The function zlib_deflate_workspacesize() now take 2 arguments.
This was done to avoid always having to allocate the maximum size
workspace (268K).  The caller can now specific the windowBits and
memLevel compression parameters to get a smaller workspace.

For our purposes we introduce a spl_zlib_deflate_workspacesize()
wrapper which accepts both arguments.  When the two argument
version of zlib_deflate_workspacesize() is available the arguments
are passed through.  When it's not we assume the worst case and
a maximally sized workspace is used.
2011-04-20 14:39:15 -07:00
Brian Behlendorf b1cbc4610c Linux 2.6.39 compat, kern_path_parent()
The path_lookup() function has been renamed to kern_path_parent()
and the flags argument has been removed.  The only behavior now
offered is that of LOOKUP_PARENT.  The spl already always passed
this flag so dropping the flag does not impact us.
2011-04-20 12:30:17 -07:00
Brian Behlendorf 83c623aa1a Linux 2.6.39 compat, DEFINE_SPINLOCK()
This is a long over due compatibility change.  Way, way, way back
in 2007 there was a push to remove all consumers of SPIN_LOCK_UNLOCKED.
Finally, in 2011 with 2.6.39 all the consumers have been updated
and SPIN_LOCK_UNLOCKED was removed.  It's about time we use the
new API as well, this change does exactly that.  DEFINE_SPINLOCK()
was available as far back as 2.6.12 so there doesn't need to be
any additional autoconf-foo for this change.
2011-04-20 12:01:11 -07:00
Brian Behlendorf 98e2afd1c5 Fix unused variable
Flagged by the default -Wunused-but-set-variable gcc option when
running under Fedora 15.  Since it's correct this variable is
entirely unused this commit removes it.
2011-04-19 09:45:36 -07:00
Brian Behlendorf 03318641af Fix gcc configure warnings
Newer versions of gcc are getting smart enough to detect the sloppy
syntax used for the autoconf tests.  It is now generating warnings
for unused/undeclared variables.  Newer version of gcc even have
the -Wunused-but-set-variable option set by default.  This isn't a
problem except when -Werror is set and they get promoted to an error.
In this case the autoconf test will return an incorrect result which
will result in a build failure latter on.

To handle this I'm tightening up many of the autoconf tests to
explicitly mark variables as unused to suppress the gcc warning.
Remember, all of the autoconf code can never actually be run we
just want to get a clean build error to detect which APIs are
available.  Never using a variable is absolutely fine for this.
2011-04-19 09:41:41 -07:00
Brian Behlendorf 9b0f9079d2 Linux 2.6.39 compat, invalidate_inodes()
To resolve a potiential filesystem corruption issue a second
argument was added to invalidate_inodes().  This argument controls
whether dirty inodes are dropped or treated as busy when invalidating
a super block.  When only the legacy API is available the second
argument will be dropped for compatibility.
2011-04-19 09:08:08 -07:00
Brian Behlendorf 96cdefab84 Fix rebuildable RPMs for el6/ch5
When rebuilding the source RPM under el5 you need to append the
target_cpu.  However, under el6/ch5 things are packaged correctly
and the arch is already part of kver.  For this reason it also
needs to be stripped from kver when setting kverpkg.
2011-04-08 10:20:08 -07:00
Brian Behlendorf a40c3fca6f Prep spl-0,6,0-rc3 tag
Create the third 0.6.0 release candidate tag (rc3).
2011-04-06 20:10:57 -07:00
Brian Behlendorf e76f4bf11d Add dnlc_reduce_cache() support
Provide the dnlc_reduce_cache() function which attempts to prune
cached entries from the dcache and icache.  After the entries are
pruned any slabs which they may have been using are reaped.

Note the API takes a reclaim percentage but we don't have easy
access to the total number of cache entries to calculate the
reclaim count.  However, in practice this doesn't need to be
exactly correct.  We simply need to reclaim some useful fraction
(but not all) of the cache.  The caller can determine if more
needs to be done.
2011-04-06 20:06:03 -07:00
Brian Behlendorf 83150861e6 Decrease target objects per slab
By decreasing the number of target objects per slab we increase
the likelyhood that a slab can be freed.  This reduces the level
of fragmentation in the slab which has been observed to be a
problem for certain workloads.  The penalty for this is that we
also decrease the speed which need objects can be allocated.
2011-04-06 20:06:03 -07:00
Brian Behlendorf 3336e29cc2 Add slab usage summeries to /proc
One of the most common things you want to know when looking at
the slab is how much memory is being used.  This information was
available in /proc/spl/kmem/slab but only on a per-slab basis.
This commit adds the following /proc/sys/kernel/spl/kmem/slab*
entries to make total slab usage easily available at a glance.

  slab_kmem_total - Total kmem slab size
  slab_kmem_avail - Alloc'd kmem slab size
  slab_kmem_max   - Max observed kmem slab size
  slab_vmem_total - Total vmem slab size
  slab_vmem_avail - Alloc'd vmem slab size
  slab_vmem_max   - Max observed vmem slab size

NOTE: The slab_*_max values are expected to over report because
they show maximum values since boot, not current values.
2011-04-06 20:06:03 -07:00
Brian Behlendorf d0a1038ff3 Update /proc/spl/kmem/slab output
The 'slab_fail', 'slab_create', and 'slab_destroy' columns in the slab
output have been removed because they are virtually always zero and
not very useful.

The much more useful 'size' and 'alloc' columns have been added which
show the total slab size and how much of the total size has been
allocated to objects.

Finally, the formatting has been updated to be much more human
readable while still being friendly for tool like awk to parse.
2011-04-06 20:06:03 -07:00
Brian Behlendorf 495bd532ab Linux shrinker compat
The Linux shrinker has gone through three API changes since 2.6.22.
Rather than force every caller to understand all three APIs this
change consolidates the compatibility code in to the mm-compat.h
header.  The caller then can then use a single spl provided
shrinker API which does the right thing for your kernel.

SPL_SHRINKER_CALLBACK_PROTO(shrinker_callback, cb, nr_to_scan, gfp_mask);
SPL_SHRINKER_DECLARE(shrinker_struct, shrinker_callback, seeks);
spl_register_shrinker(&shrinker_struct);
spl_unregister_shrinker(&&shrinker_struct);
spl_exec_shrinker(&shrinker_struct, nr_to_scan, gfp_mask);
2011-04-06 20:06:03 -07:00
Brian Behlendorf 91cb1d91a4 Add .va_dentry helper
While this extra structure memory does not exist under Solaris
it is needed under Linux to pass the dentry.  This allows the
dentry to be easily instantiated before the inode is unlocked.
2011-04-06 20:06:03 -07:00
Brian Behlendorf af67391e45 Update CHAOS 5 Packaging
The CHAOS 5 kernels are now packaged identially to the RHEL6 kernels.
Therefore we can simply use the RHEL6 rules in the spec file when
building packages.
2011-03-31 13:49:22 -07:00
Brian Behlendorf 734fcac78d Add crgetfsuid()/crgetfsgid() helpers
Solaris credentials don't have an fsuid/fsguid field but Linux
credentials do.  To handle this case the Solaris API is being
modestly extended to include the crgetfsuid()/crgetfsgid()
helper functions.

Addititionally, because the crget*() helpers are implemented
identically regardless of HAVE_CRED_STRUCT they have been
moved outside the #ifdef to common code.  This simplification
means we only have one version of the helper to keep to to date.
2011-03-22 12:18:44 -07:00
Brian Behlendorf 9b0c3b2aa8 Load zlib_inflate.ko
Certain stock kernels (Debian Lenny) are built with zlib_inflate.ko
as a kernel module.  To ensure 'make check' works in-tree load this
module before loading the spl module.  This is now required for the
zlib splat regression test.
2011-03-22 12:18:44 -07:00
Brian Behlendorf 2092cf68d8 Disable vmalloc() direct reclaim
As part of vmalloc() a __pte_alloc_kernel() allocation may occur.  This
internal allocation does not honor the gfp flags passed to vmalloc().
This means even when vmalloc(GFP_NOFS) is called it is possible that a
synchronous reclaim will occur.  This reclaim can trigger file IO which
can result in a deadlock.  This issue can be avoided by explicitly
setting PF_MEMALLOC on the process to subvert synchronous reclaim when
vmalloc() is called with !__GFP_FS.

An example stack of the deadlock can be found here (1), along with the
upstream kernel bug (2), and the original bug discussion on the
linux-mm mailing list (3).  This code can be properly autoconf'ed
when the upstream bug is fixed.

1) http://github.com/behlendorf/zfs/issues/labels/Vmalloc#issue/133
2) http://bugzilla.kernel.org/show_bug.cgi?id=30702
3) http://marc.info/?l=linux-mm&m=128942194520631&w=4
2011-03-20 15:12:08 -07:00
Brian Behlendorf cb255ae572 Remove default GFP_NOFS allocations
As originally described in commit 82b8c8fa64
this was done to prevent certain deadlocks from occuring in the system.
However, as suspected the price for doing this proved to be too high.
The VM is having a hard time effectively reclaiming memory thus we are
reverting this change.

However, we still need to fundamentally handle the issue.  Under
Solaris the KM_PUSHPAGE mask is used commonly in I/O paths to ensure
a memory allocations will succeed.  We leverage this fact and redefine
KM_PUSHPAGE to include GFP_NOFS.  This ensures that in these common
I/O path we don't trigger additional reclaim.  This minimizes the
change to the Solaris code.
2011-03-19 14:50:39 -07:00
Brian Behlendorf 181a9b8998 Prep spl-0.6.0-rc2 tag
Create the second 0.6.0 release candidate tag (rc2).
2011-03-09 15:16:10 -08:00
Brian Behlendorf 6788762766 Linux 2.6.31 compat, include linux/seq_file.h
Explicitly include the linux/seq_file.h header in vfs.h.  This header
is required for the sequence handlers and is included indirectly in
newer kernels.
2011-03-07 13:52:00 -08:00
Brian Behlendorf 912fd84d13 Make Missing Modules.symvers Fatal
Detect early on in configure if the Modules.symvers file is missing.
Without this file there will be build failures later and it's best
to catch this early and provide a useful error.  In this case the
most likely problem is the kernel-devel packages are not installed.
It may also be possible that they are using an unbuilt custom kernel
in which case they must build the kernel first.
2011-03-07 13:09:01 -08:00
Brian Behlendorf 7731d46b69 Make CONFIG_PREEMPT Fatal
Until support is added for preemptible kernels detect this at
configure time and make it fatal.  Otherwise, it is possible to
have a successful build and kernel modules with flakey behavior.
2011-03-07 10:58:07 -08:00
Brian Behlendorf 47995fa691 Remove xvattr support
The xvattr support in the spl has always simply consisted of
defining a couple structures and a few #defines.  This was enough
to enable compilation of code which just passed xvattr types
around but not enough to effectively manipulate them.

This change removes even this minimal support leaving it up
to packages which leverage the spl to prove the full xvattr
support.  By removing it from the spl we ensure not conflict
with the higher level packages.

This just leaves minimal vnode support for basical manipulation
of files.  This code is does have the proper support functions
in the spl and a set of regression tests.

Additionally, this change removed the unused 'caller_context_t *'
type and replaces it with a 'void *'.
2011-03-02 11:34:46 -08:00
Brian Behlendorf a4a1e1ecb4 Add TIMESPEC_OVERFLOW helper
Add the TIMESPEC_OVERFLOW helper macro to allow easy checking
of timespec overflow.
2011-03-02 11:34:43 -08:00
Brian Behlendorf 19c1eb829d Add zlib regression test
A zlib regression test has been added to verify the correct behavior
of z_compress_level() and z_uncompress.  The test case simply takes
a 128k buffer, it compresses the buffer, it them uncompresses the
buffer, and finally it compares the buffers after the transform.
If the buffers match then everything is fine and no data was lost.
It performs this test for all 9 zlib compression levels.
2011-02-25 16:56:46 -08:00
Brian Behlendorf 5c1967ebe2 Fix zlib compression
While portions of the code needed to support z_compress_level() and
z_uncompress() where in place.  In reality the current implementation
was non-functional, it just was compilable.

The critical missing component was to setup a workspace for the
compress/uncompress stream structures to use.  A kmem_cache was
added for the workspace area because we require a large chunk
of memory.  This avoids to need to continually alloc/free this
memory and vmap() the pages which is very slow.  Several objects
will reside in the per-cpu kmem_cache making them quick to acquire
and release.  A further optimization would be to adjust the
implementation to additional ensure the memory is local to the cpu.
Currently that may not be the case.
2011-02-25 16:56:22 -08:00
Brian Behlendorf 5a52a782a0 Use Linux flock struct
Rather than defining our own structure which will conflict with
Linux's version when building 32-bit.  Simply setup a typedef
to always use the correct Linux version for both 32 ad 64-bit
builds.
2011-02-23 14:32:15 -08:00
Brian Behlendorf 914b063133 Linux compat 2.6.37, invalidate_inodes()
In the 2.6.37 kernel the function invalidate_inodes() is no longer
exported for use by modules.  This memory management functionality
is needed to invalidate the inodes attached to a super block without
unmounting the filesystem.

Because this function still exists in the kernel and the prototype
is available is a common header all we strictly need is the symbol
address.  The address is obtained using spl_kallsyms_lookup_name()
and assigned to the variable invalidate_inodes_fn.  Then a #define
is used to replace all instances of invalidate_inodes() with a
call to the acquired address.  All the complexity is hidden behind
HAVE_INVALIDATE_INODES and invalidate_inodes() can be used as usual.

Long term we should try to get this, or another, interface made
available to modules again.
2011-02-23 12:44:32 -08:00
Brian Behlendorf bf665d4075 Prep spl-0.6.0-rc1 tag
Create the first 0.6.0 release candidate tag (rc1).
2011-02-18 09:35:55 -08:00
Brian Behlendorf 22ccfaa8b5 Prefer /lib/modules/$(uname -r)/ links
Preferentially use the /lib/modules/$(uname -r)/source and
/lib/modules/$(uname -r)/build links.  Only if neither of these
links exist fallback to alternate methods for deducing which
kernel to build with.  This resolves the need to manually
specify --with-linux= and --with-linux-obj= on Debian systems.
2011-02-10 14:47:08 -08:00
Brian Behlendorf 0d33908cdf Update META to 0.6.0
Roll the version forward to 0.6.0.  While no major changes
really warrant this I want to keep the version in step with
ZFS for now which is the only SPL consumer.
2011-02-07 16:42:52 -08:00
Brian Behlendorf d599e4fa79 Block in cv_destroy() on all waiters
Previously we would ASSERT in cv_destroy() if it was ever called
with active waiters.  However, I've now seen several instances in
OpenSolaris code where they do the following:

  cv_broadcast();
  cv_destroy();

This leaves no time for active waiters to be woken up and scheduled
and we trip the ASSERT.  This has not been observed to be an issue
on OpenSolaris because their cv_destroy() basically does nothing.
They still do run the risk of the memory being free'd after the
cv_destroy() and hitting a bad paging request.  But in practice
this race is so small and unlikely it either doesn't happen, or
is so unlikely when it does happen the root cause has not yet been
identified.

Rather than risk the same issue in our code this change updates
cv_destroy() to block until all waiters have been woken and
scheduled.  This may take some time because each waiter must
acquire the mutex.

This change may have an impact on performance for frequently
created and destroyed condition variables.  That however is a price
worth paying it avoid crashing your system.  If performance issues
are observed they can be addressed by the caller.
2011-02-04 14:09:08 -08:00
Brian Behlendorf 0aff071d18 Minor policy interface
Simply add the policy function wrappers.  They are completely
non-functional and always return that everything is OK, but once
again they simplify compilation of dependent packages for now.
These can/should be removed once the security policy of the
dependent application is completely understood and intergrade
as appropriate with Linux.
2011-01-27 16:06:09 -08:00
Brian Behlendorf ef57fb98e4 Add missing headers
Dependent packages require the following missing headers to
simplify compilation.  The headers are basically just stubbed
out with minimal content required.
2011-01-27 16:06:09 -08:00
Brian Behlendorf 3fc97f9335 Add VSA_ACE_* and MAX_ACL_ENTRIES defines
The following flags are use to get the proper mask when getting
and setting ACLs.  I'm hopeful this can all largely go away at
some point.

We also add a define for the maximum number of ACL entries.
MAX_ACL_ENTRIES is used as the maximum number of entries for
each type.
2011-01-27 16:06:09 -08:00
Brian Behlendorf e2b25f698c Add MAXUID define
For Linux the maximum uid can vary depending on how your kernel
is built.  The Linux kernel still can be compiled with 16 but uids
and gids, although I'm not aware of a major distribution which does
this (maybe an embedded one?).  Given that caviot it is reasonably
safe to define the MAXUID as 2147483647.
2011-01-27 16:06:09 -08:00
Brian Behlendorf 5f46a517f1 Add FIGNORECASE define
The FIGNORECASE case define is now needed, place it with the
related flags.
2011-01-27 16:06:09 -08:00
Brian Behlendorf 3e5d3d3285 Add ksid_index_t and ksid_t types
Add the ksid_index_t enum and ksid_t type for use.  These types
are now used by packages which depend on the SPL.
2011-01-27 16:06:09 -08:00
Brian Behlendorf d700637207 Minimal VFS additions
This patch simply removes the place holder vfs_t type and includes
some generic Linux VFS headers.  It also makes some minor fid_t
additions for compatibility.
2011-01-27 16:06:04 -08:00
Brian Behlendorf 647fa73cf3 Remove VN_HOLD/VN_RELE/VOP_PUTPAGE
Previously these were defined to noops but rather than give
the misleading impression that these are actually implemented
I'm removing the type entirely for clarity.
2011-01-12 11:38:05 -08:00
Brian Behlendorf bd6ac72b03 Add a few additional vnode #defines
These additional constants now have users in dependant packages.
2011-01-12 11:38:05 -08:00
Brian Behlendorf a5b40eed17 Make vn_cache|vn_file_cache kmem caches
Both of these caches were previously allowed to be either a
vmem or kmem cache based on the size of the object involved.
Since we know the object won't be to large and performce is
much better for a kmem cache for them to be kmem backed.
2011-01-12 11:38:05 -08:00
Brian Behlendorf dcd9cb5a17 Clean vattr_t and vsecattr_t types
Minor cleanup for the vattr_t and vsecattr_t types.
2011-01-12 11:38:04 -08:00
Brian Behlendorf 1b439713f1 FRSYNC Should Use O_SYNC
The Solaris FRSYNC maps most logically to the Linux O_SYNC.  There
is no O_RSYNC on Linux but this wasn't noticed until just recently.
2011-01-12 11:38:04 -08:00
Brian Behlendorf 4295b530ee Add vn_mode_to_vtype/vn_vtype to_mode helpers
Add simple helpers to convert a vnode->v_type to a inode->i_mode.
These should be used sparingly but they are handy to have.
2011-01-12 11:38:04 -08:00
Neependra Khare 3f688a8c38 Add cv_timedwait_interruptible() function
The cv_timedwait() function by definition must wait unconditionally
for cv_signal()/cv_broadcast() before waking.  This causes processes
to go in the D state which increases the load average.  The load
average is the summation of processes in D state and run queue.

To avoid this it can be desirable to sleep interruptibly.  These
processes do not count against the load average but may be woken by
a signal.  It is up to the caller to determine why the process
was woken it may be for one of three reasons.

  1) cv_signal()/cv_broadcast()
  2) the timeout expired
  3) a signal was received

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-01-11 12:14:48 -08:00
Brian Behlendorf 6bf4d76f47 Linux Compat: inode->i_mutex/i_sem
Create spl_inode_lock/spl_inode_unlock compability macros to simply
access to the inode mutex/sem.  This avoids the need to have to ugly
up the code with the required #define's at every call site.  At the
moment the SPL only uses this in one place but higher layers can
benefit from the macro.
2011-01-11 12:14:48 -08:00
Brian Behlendorf b7dc313837 Add Thread Specific Data (TSD) Regression Test
To validate the correct behavior of the TSD interfaces it's
important that we add a regression test.  This test is designed
to minimally exercise the fundamental TSD behavior, it does not
attempt to validate all potential corner cases.

The test will first create 32 keys via tsd_create() and register
a common destructor.  Next 16 wait threads will be created each
of which set/verify a random value for all 32 keys, then block
waiting to be released by the control thread.  Meanwhile the
control thread verifies that none of the destructors have been
run prematurely.

The next phase of the test is to create 16 exit threads which
set/verify a random value for all 32 keys.  They then immediately
exit.  This is is designed to verify tsd_exit() which will be
called via thread_exit().  This must result in all registered
destructors being run and the memory for the tsd being free'd.

After this tsd_destroy() is verified by destroying all 32 keys.
Once again we must see the expected number of destructors run
and the tsd memory free'd.  At this point the blocked threads
are released and they exit calling tsd_exit() which should do
very little since all the tsd has already been destroyed.

If this all goes off without a hitch the test passes.  To ensure
no memory has been leaked, I have manually verified that after
spl module unload no memory is reported leaked.
2010-12-07 10:02:44 -08:00
Brian Behlendorf 9fe45dc1ac Add Thread Specific Data (TSD) Implementation
Thread specific data has implemented using a hash table, this avoids
the need to add a member to the task structure and allows maximum
portability between kernels.  This implementation has been optimized
to keep the tsd_set() and tsd_get() times as small as possible.

The majority of the entries in the hash table are for specific tsd
entries.  These entries are hashed by the product of their key and
pid because by design the key and pid are guaranteed to be unique.
Their product also has the desirable properly that it will be uniformly
distributed over the hash bins providing neither the pid nor key is zero.
Under linux the zero pid is always the init process and thus won't be
used, and this implementation is careful to never to assign a zero key.
By default the hash table is sized to 512 bins which is expected to
be sufficient for light to moderate usage of thread specific data.

The hash table contains two additional type of entries.  They first
type is entry is called a 'key' entry and it is added to the hash during
tsd_create().  It is used to store the address of the destructor function
and it is used as an anchor point.  All tsd entries which use the same
key will be linked to this entry.  This is used during tsd_destory() to
quickly call the destructor function for all tsd associated with the key.
The 'key' entry may be looked up with tsd_hash_search() by passing the
key you wish to lookup and DTOR_PID constant as the pid.

The second type of entry is called a 'pid' entry and it is added to the
hash the first time a process set a key.  The 'pid' entry is also used
as an anchor and all tsd for the process will be linked to it.  This
list is using during tsd_exit() to ensure all registered destructors
are run for the process.  The 'pid' entry may be looked up with
tsd_hash_search() by passing the PID_KEY constant as the key, and
the process pid.  Note that tsd_exit() is called by thread_exit()
so if your using the Solaris thread API you should not need to call
tsd_exit() directly.
2010-12-07 10:02:32 -08:00
Brian Behlendorf 8beea9ac24 Refresh autogen.sh products
Refresh the autogen.sh products based on the versions which are
installed by default in the GA RHEL6.0 release.

autoconf (GNU Autoconf) 2.63
automake (GNU automake) 1.11.1
ltmain.sh (GNU libtool) 2.2.6b
2010-11-30 10:36:58 -08:00
Ricardo M. Correia c2f997b0b3 Make kmutex_t typesafe in all cases.
When HAVE_MUTEX_OWNER and CONFIG_SMP are defined, kmutex_t is just
a typedef for struct mutex.

This is generally OK but has the downside that it can make mistakes
such as mutex_lock(&kmutex_var) to pass by unnoticed until someone
compiles the code without HAVE_MUTEX_OWNER or CONFIG_SMP (in which
case kmutex_t is a real struct). Note that the correct API to call
should have been mutex_enter() rather than mutex_lock().

We prevent these kind of mistakes by making kmutex_t a real structure
with only one field. This makes kmutex_t typesafe and it shouldn't
have any impact on the generated assembly code.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-29 11:25:32 -08:00
Brian Behlendorf 058de03caa Clear cv->cv_mutex when not in use
For debugging purposes the condition varaibles keep track of the
mutex used during a wait.  The idea is to validate that all callers
always use the same mutex.  Unfortunately, we have seen cases where
the caller reuses the condition variable with a different mutex but
in a way which is known to be safe.  My reading of the man pages
suggests you should not do this and always cv_destroy()/cv_init()
a new mutex.  However, there is overhead in doing this and it does
appear to be allowed under Solaris.

To accomidate this behavior cv_wait_common() and __cv_timedwait()
have been modified to clear the associated mutex when the last
waiter is dropped.  This ensures that while the condition variable
is in use the incorrect mutex case is detected.  It also allows the
condition variable to be safely recycled without requiring the
overhead of a cv_destroy()/cv_init() as long as it isn't currently
in use.

Finally, spin lock cv->cv_lock was removed because it is not required.
When the condition variable is used properly the caller will always
be holding the mutex so the spin lock is redundant.  The lock was
originally added because I expected to need to protect more than
just the cv->cv_mutex.  It turns out that was not the case.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-29 11:02:34 -08:00
Ned Bass 00ba7ef900 Give ENOTSUP a valid user space error value
The ZFS module returns ENOTSUP for several error conditions where an operation
is not (yet) supported.  The SPL defined ENOTSUP in terms of ENOTSUPP, but that
is an internal Linux kernel error code that should not be seen by user
programs.  As a result the zfs utilities print a confusing error message if an
unsupported operation is attempted:

    internal error: Unknown error 524
    Aborted

This change defines ENOTSUP in terms of EOPNOTSUPP which is consistent with
user space.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-11-10 13:25:49 -08:00
Brian Behlendorf 8655ce492f Linux 2.6.36 compat, use fops->unlocked_ioctl()
As of linux-2.6.36 the last in-tree consumer of fops->ioctl() has
been removed and thus fops()->ioctl() has also been removed.  The
replacement hook is fops->unlocked_ioctl() which has existed in
kernel since 2.6.12.  Since the SPL only contains support back
to 2.6.18 vintage kernels, I'm not adding an autoconf check for
this and simply moving everything to use fops->unlocked_ioctl().
2010-11-10 13:16:12 -08:00
Brian Behlendorf 9b2048c26b Linux 2.6.36 compat, fs_struct->lock type change
In the linux-2.6.36 kernel the fs_struct lock was changed from a
rwlock_t to a spinlock_t.  If the kernel would export the set_fs_pwd()
symbol by default this would not have caused us any issues, but they
don't.  So we're forced to add a new autoconf check which sets the
HAVE_FS_STRUCT_SPINLOCK define when a spinlock_t is used.  We can
then correctly use either spin_lock or write_lock in our custom
set_fs_pwd() implementation.
2010-11-09 13:29:47 -08:00
Brian Behlendorf a50cede388 Linux 2.6.36 compat, wrap RLIM64_INFINITY
As of linux-2.6.36 RLIM64_INFINITY is defined in linux/resource.h.
This is handled by conditionally defining RLIM64_INFINITY in the
SPL only when the kernel does not provide it.
2010-11-09 13:28:55 -08:00
Brian Behlendorf 1e18307b61 Fix incorrect krw_type_t type
Flagged by the default compile options on archlinux 2010.05, we should
be using the krw_t type not the krw_type_t type in the private data.

  module/splat/splat-rwlock.c: In function ‘splat_rwlock_test4_func’:
  module/splat/splat-rwlock.c:432:6: warning: case value ‘1’ not in
  enumerated type ‘krw_type_t’
2010-11-09 10:18:01 -08:00
Brian Behlendorf c11908c75d Prep for 0.5.2 tag
Update META file to prep for 0.5.2 tag.
2010-11-05 11:52:46 -07:00
Brian Behlendorf 8294c69bb7 Clear owner after dropping mutex
It's important to clear mp->owner after calling mutex_unlock()
because when CONFIG_DEBUG_MUTEXES is defined the mutex owner
is verified in mutex_unlock().  If we set it to NULL this check
fails and the lockdep support is immediately disabled.
2010-11-05 11:52:30 -07:00
Brian Behlendorf 23aa63cbf5 Fix 2.6.35 shrinker callback API change
As of linux-2.6.35 the shrinker callback API now takes an additional
argument.  The shrinker struct is passed to the callback so that users
can embed the shrinker structure in private data and use container_of()
to access it.  This removes the need to always use global state for the
shrinker.

To handle this we add the SPL_AC_3ARGS_SHRINKER_CALLBACK autoconf
check to properly detect the API.  Then we simply setup a callback
function with the correct number of arguments.  For now we do not make
use of the new 3rd argument.
2010-10-22 14:51:26 -07:00
Ricardo M. Correia a68d91d770 atomic_*_*_nv() functions need to return the new value atomically.
A local variable must be used for the return value to avoid a
potential race once the spin lock is dropped.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-09-17 16:03:25 -07:00
Brian Behlendorf d5fcc5f51c Fix markdown rendering
These two lines were being rendered incorrectly on the GitHub
site.  To fix the issue there needs to be leading whitespace
before each line to ensure each command is rendered on its
own line.

$ ./configure
$ make pkg
2010-09-15 09:05:34 -07:00
Brian Behlendorf 4bc4f6d854 Reference new zfsonlinux.org website
The wiki contents have been converted to html and made available
at their new home http://zfsonlinux.org.  The wiki has also been
disabled the html pages are now the official documentation.
2010-09-14 15:54:15 -07:00
Brian Behlendorf a7958f7eef Support custom build directories
One of the neat tricks an autoconf style project is capable of
is allow configurion/building in a directory other than the
source directory.  The major advantage to this is that you can
build the project various different ways while making changes
in a single source tree.

For example, this project is designed to work on various different
Linux distributions each of which work slightly differently.  This
means that changes need to verified on each of those supported
distributions perferably before the change is committed to the
public git repo.

Using nfs and custom build directories makes this much easier.
I now have a single source tree in nfs mounted on several different
systems each running a supported distribution.  When I make a
change to the source base I suspect may break things I can
concurrently build from the same source on all the systems each
in their own subdirectory.

wget -c http://github.com/downloads/behlendorf/spl/spl-x.y.z.tar.gz
tar -xzf spl-x.y.z.tar.gz
cd spl-x-y-z

------------------------- run concurrently ----------------------
<ubuntu system>  <fedora system>  <debian system>  <rhel6 system>
mkdir ubuntu     mkdir fedora     mkdir debian     mkdir rhel6
cd ubuntu        cd fedora        cd debian        cd rhel6
../configure     ../configure     ../configure     ../configure
make             make             make             make
make check       make check       make check       make check

This is something the project has almost supported for a long time
but finishing this support should save me lots of time.
2010-09-05 21:49:05 -07:00
Brian Behlendorf d8a1b73935 Remove spl-x.y.z.zip creation in 'make dist'
Do no create a spl-x.y.z.zip file as part of 'make dist'.  Simply
create the standard spl-x.y.z.tar.gz file.
2010-09-02 16:12:02 -07:00
Brian Behlendorf 73fc084e92 Move vendor check to spl-build.m4
This check was previously done with a hack in config.guess.
However, since a new config.guess is copied in to place when
forcing a full autoreconf this change was easily lost and
never a good idea.  This commit also updates all of the
autoconf style support scripts in config.
2010-09-02 16:12:02 -07:00
Brian Behlendorf 6295556b71 Prep for spl-0.5.1 tag 2010-09-01 10:24:44 -07:00
Brian Behlendorf 53be2266e1 Add quick build instructions
Full update to date build information will stay on the wiki for
now, but there is no harm in adding the bare bones instructions
to the README.  They shouldn't change and are a reasonable
quick start.
2010-09-01 10:23:05 -07:00
Brian Behlendorf 8371f981f1 Add list_link_replace() function
The list_link_replace() function with swap a new item it to the place
of an old item in a list.  It is the callers responsibility to ensure
all lists involved are locked properly.
2010-08-27 14:23:48 -07:00
Brian Behlendorf d85e28ad69 Add MUTEX_NOT_HELD() function
Simply implement the missing MUTEX_NOT_HELD() function using
the !MUTEX_HELD construct.
2010-08-27 14:23:48 -07:00
Brian Behlendorf 2b3543025c Stub out kmem cache defrag API
At some point we are going to need to implement the kmem cache
move callbacks to allow for kmem cache defragmentation.  This
commit simply lays a small part of the API ground work, it does
not actually implement any of this feature.  This is safe for
now because the move callbacks are just an optimization.  Even
if they are registered we don't ever really have to call them.
2010-08-27 14:23:42 -07:00
Brian Behlendorf 8dbd3fbd5e Add missing atomic functions
These functions were not previous needed so they were not added.
Now they are so add the full set.

atomic_inc_32_nv()
atomic_dec_32_nv()
atomic_inc_64_nv()
atomic_dec_64_nv()
2010-08-27 13:02:55 -07:00
Brian Behlendorf 1db69544cc Prep for spl-0.5.0 tag 2010-08-13 09:33:50 -07:00
Li Wei 4be55565fe Fix stack overflow in vn_rdwr() due to memory reclaim
Unless __GFP_IO and __GFP_FS are removed from the file mapping gfp
mask we may enter memory reclaim during IO.  In this case shrink_slab()
entered another file system which is notoriously hungry for stack.
This additional stack usage may cause a stack overflow.  This patch
removes __GFP_IO and __GFP_FS from the mapping gfp mask of each file
during vn_open() to avoid any reclaim in the vn_rdwr() IO path.  The
original mask is then restored at vn_close() time.  Hats off to the
loop driver which does something similiar for the same reason.

  [...]
  shrink_slab+0xdc/0x153
  try_to_free_pages+0x1da/0x2d7
  __alloc_pages+0x1d7/0x2da
  do_generic_mapping_read+0x2c9/0x36f
  file_read_actor+0x0/0x145
  __generic_file_aio_read+0x14f/0x19b
  generic_file_aio_read+0x34/0x39
  do_sync_read+0xc7/0x104
  vfs_read+0xcb/0x171
  :spl:vn_rdwr+0x2b8/0x402
  :zfs:vdev_file_io_start+0xad/0xe1
  [...]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-12 09:34:33 -07:00
Ned Bass 46aa7b3939 Correctly handle rwsem_is_locked() behavior
A race condition in rwsem_is_locked() was fixed in Linux 2.6.33 and the fix was
backported to RHEL5 as of kernel 2.6.18-190.el5.  Details can be found here:

https://bugzilla.redhat.com/show_bug.cgi?id=526092

The race condition was fixed in the kernel by acquiring the semaphore's
wait_lock inside rwsem_is_locked().  The SPL worked around the race condition
by acquiring the wait_lock before calling that function, but with the fix in
place it must not do that.

This commit implements an autoconf test to detect whether the fixed version of
rwsem_is_locked() is present.  The previous version of rwsem_is_locked() was an
inline static function while the new version is exported as a symbol which we
can check for in module.symvers.  Depending on the result we correctly
implement the needed compatibility macros for proper spinlock handling.

Finally, we do the right thing with spin locks in RW_*_HELD() by using the
new compatibility macros.  We only only acquire the semaphore's wait_lock if
it is calling a rwsem_is_locked() that does not itself try to acquire the lock.

Some new overhead and a small harmless race is introduced by this change.
This is because RW_READ_HELD() and RW_WRITE_HELD() now acquire and release
the wait_lock twice: once for the call to rwsem_is_locked() and once for
the call to rw_owner().  This can't be avoided if calling a rwsem_is_locked()
that takes the wait_lock, as it will in more recent kernels.

The other case which only occurs in legacy kernels could be optimized by
taking the lock only once, as was done prior to this commit.  However, I
decided that the performance gain probably wasn't significant enough to
justify the messy special cases required.

The function spl_rw_get_owner() was only used to enable the afore-mentioned
optimization.  Since it is no longer used, I removed it.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-10 16:43:00 -07:00
Ned Bass 5ec44a37c3 Correctly detect atomic64_cmpxchg support
The RHEL5 2.6.18-194.7.1.el5 kernel added atomic64_cmpxchg to
asm-x86_64/atomic.h.  That macro is defined in terms of cmpxchg which
is provided by asm/system.h. However, asm/system.h is not #included by
atomic.h in this kernel nor by the autoconf test for atomic64_cmpxchg, so
the test failed with "implicit declaration of function 'cmpxchg'". This
leads the build system to erroneously conclude that the kernel does not
define atomic64_cmpxchg and enable the built-in definition.  This in
turn produces a '"atomic64_cmpxchg" redefined' build warning which is fatal
when building with --enable-debug.  This commit fixes this by including
asm/system.h in the autoconf test.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-08 13:48:03 -07:00
Ricardo M. Correia 26f7245c7c Fix taskq code to not drop tasks when TQ_SLEEP is used.
When TQ_SLEEP is used, taskq_dispatch() should always succeed even if the
number of pending tasks is above tq->tq_maxalloc. This semantic is similar
to KM_SLEEP in kmem allocations, which also always succeed.

However, we cannot block forever otherwise there is a risk of deadlock.
Therefore, we still allow the number of pending tasks to go above
tq->tq_maxalloc with TQ_SLEEP, but we may sleep up to 1 second per task
dispatch, thereby throttling the task dispatch rate.

One of the existing splat tests was also augmented to test for this scenario.
The test would fail with the previous implementation but now it succeeds.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-08-02 11:20:31 -07:00
Brian Behlendorf 41f84a8d56 Strfree() should call kfree() not kmem_free()
Using kmem_free() results in deducting X bytes from the memory
accounting when --enable-debug is set.  Unfortunately, currently
the counterpart kmem_asprintf() and friends do not properly
account for memory allocated, so we must do the same on free.
If we don't then we end up with a negative number of lost bytes
reported when the module is unloaded.

A better long term fix would be to add the accounting in to the
allocation side but that's a project for another day.
2010-07-30 22:20:58 -07:00
Brian Behlendorf 099dc9c2d2 Add uninstall Makefile targets
Extend the Makefiles with an uninstall target to cleanly
remove a package which was installed with 'make install'.

Additionally, ensure a 'depmod -a' is run as part of the
install to update the module dependency information.
2010-07-28 14:55:32 -07:00
Brian Behlendorf 287b2fb117 Add Debian and Slackware style packaging via alien
The long term fix for Debian and Slackware style packaging is
to add native support for building these packages.  Unfortunately,
that is a large chunk of work I don't have time for right now.
That said it would be nice to have at least basic packages for
these distributions.

As a quick short/medium term solution I've settled on using alien
to convert the RPM packages to DEB or TGZ style packages.  The
build system has been updated with the following build targets
which will first build RPM packages and then convert them as
needed to the target package type:

  make rpm: Create .rpm packages
  make deb: Create .deb packages
  make tgz: Create .tgz packages
  make pkg: Create the right package type for your distribution

The solution comes with lot of caveats and your mileage may vary.
But basically the big limitations are that the resulting packages:

  1) Will not have the correct dependency information.
  2) Will not not include the kernel version in the release.
  3) Will not handle all differences between distributions.

But the resulting packages should be easy to install and remove
from your system and take care of running 'depmod -a' and such.
As I said at the top this is not the right long term solution.
If any of the upstream distribution maintainers want to jump in
and help do this right for their distribution I'd love the help.
2010-07-27 15:52:34 -07:00
Brian Behlendorf 10129680f8 Ensure kmem_alloc() and vmem_alloc() never fail
The Solaris semantics for kmem_alloc() and vmem_alloc() are that they
must never fail when called with KM_SLEEP.  They may only fail if
called with KM_NOSLEEP otherwise they must block until memory is
available.  This is quite different from how the Linux memory
allocators work, under Linux a memory allocation failure is always
possible and must be dealt with.

At one point in the past the kmem code did properly implement this
behavior, however as the code evolved this behavior was overlooked
in places.  This patch goes through all three implementations of
the kmem/vmem allocation functions and ensures that they will all
block in the KM_SLEEP case when memory is not available.  They
may still fail in the KM_NOSLEEP case in which case the caller
is responsible for handling the failure.

Special care is taken in vmalloc_nofail() to avoid thrashing the
system on the virtual address space spin lock.  The down side of
course is if you do see a failure here, which is unlikely for
64-bit systems, your allocation will delay for an entire second.
Still this is preferable to locking up your system and it is the
best we can do given the constraints.

Additionally, the code was cleaned up to be much more readable
and comments were added to describe the various kmem-debug-*
configure options.  The default configure options remain:
"--enable-debug-kmem --disable-debug-kmem-tracking"
2010-07-26 15:47:55 -07:00
Brian Behlendorf 849c50e7f2 Fix two minor compiler warnings
In cmd/splat.c there was a comparison between an __u32 and an int.  To
resolve the issue simply use a __u32 and strtoul() when converting the
provided user string.

In module/spl/spl-vnode.c we should explicitly cast nd->last.name to
a const char * which is what is expected by the prototype.
2010-07-26 10:24:26 -07:00
Brian Behlendorf 8b0eb3f0dc Remove deadcode caused by removal of format1 arg
Commit 55abb0929e removed the never
used format1 argument of spl_debug_msg().  That in turn resulted
in some deadcode which should be removed since it's now useless.
2010-07-21 16:31:42 -07:00
Ricardo M. Correia 15b52c083e Fix max_ncpus definition.
It was being defined as the constant 64 and at first I changed it to be
NR_CPUS instead.

However, NR_CPUS can be a large value on recent kernels (4096), and this
may cause too large kmem allocations to happen.

Therefore, now we use num_possible_cpus(), which should return a (typically)
small value which represents the maximum number of CPUs than can be brought
online in the running hardware (this value is determined at boot time by
arch-specific kernel code).

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 15:49:25 -07:00
Ricardo M. Correia 81672c0122 Display DEBUG keyword during module load when --enable-debug is used.
Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 15:31:03 -07:00
Ricardo M. Correia 2c762de830 Fix buggy kmem_{v}asprintf() functions
When the kvasprintf() call fails they should reset the arguments
by calling va_start()/va_copy() and va_end() inside the loop,
otherwise they'll try to read more arguments rather than starting
over and reading them from the beginning.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 13:51:46 -07:00
Ricardo M. Correia 9dd5d138b2 Fix bcopy() to allow memory area overlap
Under Solaris bcopy() allows overlapping memory areas so we
must use memmove() instead of memcpy().

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 13:48:53 -07:00
Ricardo M. Correia 22cd0f19b1 Fix compilation error due to undefined ACCESS_ONCE macro.
When CONFIG_DEBUG_MUTEXES is turned on in RHEL5's kernel config, the mutexes
store the owner for debugging purposes, therefore the SPL will enable
HAVE_MUTEX_OWNER. However, the SPL code uses ACCESS_ONCE() to access the
owner, and this macro is not defined in the RHEL5 kernel, therefore we define it
ourselves in include/linux/compiler_compat.h.

Signed-off-by: Ricardo M. Correia <ricardo.correia@oracle.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-20 13:47:52 -07:00
Brian Behlendorf b17edc10a9 Prefix all SPL debug macros with 'S'
To avoid conflicts with symbols defined by dependent packages
all debugging symbols have been prefixed with a 'S' for SPL.
Any dependent package needing to integrate with the SPL debug
should include the spl-debug.h header and use the 'S' prefixed
macros.  They must also build with DEBUG defined.
2010-07-20 13:30:40 -07:00
Brian Behlendorf 55abb0929e Split <sys/debug.h> header
To avoid symbol conflicts with dependent packages the debug
header must be split in to several parts.  The <sys/debug.h>
header now only contains the Solaris macro's such as ASSERT
and VERIFY.  The spl-debug.h header contain the spl specific
debugging infrastructure and should be included by any package
which needs to use the spl logging.  Finally the spl-trace.h
header contains internal data structures only used for the log
facility and should not be included by anythign by spl-debug.c.

This way dependent packages can include the standard Solaris
headers without picking up any SPL debug macros.  However, if
the dependant package want to integrate with the SPL debugging
subsystem they can then explicitly include spl-debug.h.

Along with this change I have dropped the CHECK_STACK macros
because the upstream Linux kernel now has much better stack
depth checking built in and we don't need this complexity.

Additionally SBUG has been replaced with PANIC and provided as
part of the Solaris macro set.  While the Solaris version is
really panic() that conflicts with the Linux kernel so we'll
just have to make due to PANIC.  It should rarely be called
directly, the prefered usage would be an ASSERT or VERIFY.

There's lots of change here but this cleanup was overdue.
2010-07-20 13:29:35 -07:00
Ned Bass 8f813bb168 Proposed fix for oops on SIGINT in splat atomic:64-bit test.
The threads in the splat atomic:64-bit test share the data structure
atomic_priv_t ap, which lives on the kernel stack of the splat user-space
utility.  If splat terminates before the threads, accesses to that memory
location by the other threads become invalid.  Splat synchronizes with
the threads with the call:

wait_event_interruptible(ap.ap_waitq, splat_atomic_test1_cond(&ap, i));

Apparently, the SIGINT wakes and terminates splat prematurely, so that
GPFs or other bad things happen when the threads subsequently access ap.
This commit prevents this by using the uninterruptible form:

wait_event(ap.ap_waitq, splat_atomic_test1_cond(&ap, i));
2010-07-15 12:50:15 -07:00
Brian Behlendorf d0bd694ca9 Fix -Werror=format-security compiler option
Noticed under Ubuntu kernel builds we should be passing a
format specifier and the string, not just the string.
2010-07-14 11:53:57 -07:00
Brian Behlendorf f0ff89fc86 Linux 2.6.35 compat: filp_fsync() dropped 'stuct dentry *'
The prototype for filp_fsync() drop the unused argument 'stuct dentry *'.
I've fixed this by adding the needed autoconf check and moving all of
those filp related functions to file_compat.h.  This will simplify
handling any further API changes in the future.
2010-07-14 11:40:55 -07:00
Brian Behlendorf 82b8c8fa64 Proposed fix for low memory ZFS deadlocks
Deadlocks in the zvol were observed when one of the ZFS threads
performing IO trys to allocate memory while the system is low
on memory.  The low memory condition causes dirty pages to be
synced to the zvol but this can't progress because the original
thread is blocked waiting on a memory allocation.  Thus we end
up deadlocking.

A proper solution proposed by Wizeman is to change KM_SLEEP from
GFP_KERNEL top GFP_NOFS.  This will prevent the memory allocation
which is trying to allocate memory from forcing a sync to the
zvol in shrink_page_list()->pageout().

The down side to all of this is that we are using a pretty big
hammer by changing KM_SLEEP.  This change means ALL of the zfs
memory allocations will be until to trigger dirty data to be
synced.  The caller still should be able to reclaim memory from
the various slab caches.  We will be totally dependent of other
kernel processes which happen to be running and a small number
of asynchronous reclaim threads to trigger the reclaim of dirty
data pages.  This should be OK but I think we may see some
slightly longer allocation times when under memory pressure.

We shall see.
2010-07-13 21:30:56 -07:00
Brian Behlendorf a4bfd8ea1b Add __divdi3(), remove __udivdi3() kernel dependency
Up until now no SPL consumer attempted to perform signed 64-bit
division so there was no need to support this.  That has now
changed so I adding 64-bit division support for 32-bit platforms.
The signed implementation is based on the unsigned version.

Since the have been several bug reports in the past concerning
correct 64-bit division on 32-bit platforms I added some long
over due regression tests.  Much to my surprise the unsigned
64-bit division regression tests failed.

This was surprising because __udivdi3() was implemented by simply
calling div64_u64() which is provided by the kernel.  This meant
that the linux kernels 64-bit division algorithm on 32-bit platforms
was flawed.  After some investigation this turned out to be exactly
the case.

Because of this I was forced to abandon the kernel helper and
instead to fully implement 64-bit division in the spl.  There are
several published implementation out there on how to do this
properly and I settled on one proposed in the book Hacker's Delight.
Their proposed algoritm is freely available without restriction
and I have just modified it to be linux kernel friendly.

The update implementation now passed all the unsigned and signed
regression tests.  This should be functional, but not fast, which is
good enough for out purposes.  If you want fast too I'd strongly
suggest you upgrade to a 64-bit platform.  I have also reported the
kernel bug and we'll see if we can't get it fixed up stream.
2010-07-13 16:44:02 -07:00
Brian Behlendorf d466208f1e Update config.guess to recognize additional distros
The following distros were added: redhat, fedora, debian,
ubuntu, sles, slackware, and gentoo.
2010-07-02 14:48:27 -07:00
Lars Johannsen dbe561d8ab Allow config/build to work with autoconf-2.65
As of autoconf-2.65 the AC_LANG_SOURCE source macro no longer
includes the confdef.h results when expanded.  To handle this
simply explicitly include confdef.h in conftest.c.  This will
cause two copies to of confdef.h to be added to the test for
earlier autoconf versions but this is not harmful.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-02 14:00:28 -07:00
Brian Behlendorf 1814251453 Require gawk the usermode helper fails with awk
For some reason when awk invoked by the usermode helper the command
always fails.  Interestingly gawk does not suffer from this problem
which is why I never observed this failure since the distro I tested
with all had gawk installed instead of awk.  Anyway, the simplest
thing to do here is to just make gawk mandatory.  I've added a
configure check for gawk specifically and have updated the command
to call gawk not awk.
2010-07-01 16:38:08 -07:00
Brian Behlendorf 7119bf7044 Add configure check for user_path_dir()
I didn't notice at the time but user_path_dir() was not introduced
at the same time as set_fs_pwd() change.  I had lumped the two
together but in fact user_path_dir() was introduced in 2.6.27 and
set_fs_pwd() taking 2 args was introduced in 2.6.25.  This means
builds against 2.6.25-2.6.26 kernels were broken.

To fix this I've added a check for user_path_dir() and no longer
assume that if set_fs_pwd() takes 2 args then user_path_dir() is
also available.
2010-07-01 13:53:26 -07:00
Brian Behlendorf e2d28a3743 Use $target_cpu instead of arch
We should not be using arch for a few reasons.  First off it might
not be installed on their system, and secondly they may be trying
to cross-compile.
2010-07-01 13:52:46 -07:00
Brian Behlendorf 8fd4e3af2e Check sourcelink is set before passing to readlink
When no source was found in any of the expected paths treat
this as fatal and provide the user with a hint as to what
they should do.
2010-07-01 13:52:04 -07:00
Ned Bass 55f10ae5e9 Implementation of a regression test for TQ_FRONT.
Use 3 threads and 8 tasks.  Dispatch the final 3 tasks with TQ_FRONT.
The first three tasks keep the worker threads busy while we stuff the
queues.  Use msleep() to force a known execution order, assuming
TQ_FRONT is properly honored.  Verify that the expected completion
order occurs.

The splat_taskq_test5_order() function may be useful in more than
one test.  This commit generalizes it by renaming the function to
splat_taskq_test_order() and adding a name argument instead of
assuming SPLAT_TASKQ_TEST5_NAME as the test name.

The documentation for splat taskq regression test #5 swaps the two required
completion orders in the diagram.  This commit corrects the error.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-01 10:59:52 -07:00
Ned Bass 1a73940d39 Initialize the /dev/splatctl device buffer
On open() and initialize the buffer with the SPL version string.  The
user space splat utility expects to find the SPL version string when
it opens and reads from /dev/splatctl.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-01 10:59:46 -07:00
Ned Bass f0d8bb26b4 Implementation of the TQ_FRONT flag.
Adds a task queue to receive tasks dispatched with TQ_FRONT.  Worker
threads pull tasks from this high priority queue before the default
pending queue.

Executing tasks out of FIFO order potentially breaks taskq_lowest_id()
if we do not preserve the ordering of the work list by taskqid.
Therefore, instead of always appending to the work list, we search for
the appropriate place to insert a task.  The common case is to append
to the list, so we make this operation efficient by searching the work
list in reverse order.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-07-01 10:59:38 -07:00
Brian Behlendorf c2688979a4 Remove AC_DEFINE for DEBUG/NDEBUG
Whoops, I momentarilly forgot I had explicitly set these as CC
options so dependent packages which need to include spl_config.h
would not end up having these defined which can result in
accidentally hanging debug enabled at best, or a build failure
at worst.
2010-07-01 09:40:29 -07:00
Brian Behlendorf c950d1480d Only make compiler warnings fatal with --enable-debug
While in theory I like the idea of compiler warnings always being
fatal.  In practice this causes problems when small harmless errors
cause build failures for end users.  To handle this I've updated
the build system such that -Werror is only used when --enable-debug
is passed to configure.  This is how I always build when developing
so I'll catch all build warnings and end users will not get stuck
by minor issues.
2010-06-30 17:05:36 -07:00
Brian Behlendorf 6801b7154c Linux-2.6.33 compat, O_DSYNC flag added
Prior to linux-2.6.33 only O_DSYNC semantics were implemented and
they used the O_SYNC flag.  As of linux-2.6.33 this behavior was
properly split in to O_SYNC and O_DSYNC respectively.
2010-06-30 12:49:39 -07:00
Brian Behlendorf 79a3bf130b Linux-2.6.33 compat, .ctl_name removed from struct ctl_table
As of linux-2.6.33 the ctl_name member of the ctl_table struct
has been entirely removed.  The upstream code has been updated
to depend entirely on the the procname member.  To handle this
all references to ctl_name are wrapped in a CTL_NAME macro which
simply expands to nothing for newer kernels.  Older kernels are
supported by having it expand to .ctl_name = X just as before.
2010-06-30 12:49:12 -07:00
Brian Behlendorf fd921c2e0c Linux-2.6.33 compat, check <generated/utsrelease.h> for UTS_RELEASE
It seems the upstream community moved the definition of UTS_RELEASE
yet again as of linux-2.6.33.  Update the build system to check in
all three possible locations where your kernel version may be defined.

	$kernelbuild/include/linux/version.h
	$kernelbuild/include/linux/utsrelease.h
	$kernelbuild/include/generated/utsrelease.h
2010-06-30 12:48:18 -07:00
Brian Behlendorf 1e48754059 Add basic README
A simple README with a short summary of the project and a link
directing people to the online documentation.
2010-06-29 14:18:18 -07:00
Brian Behlendorf ede0bdffb6 Treat mutex->owner as volatile
When HAVE_MUTEX_OWNER is defined and we are directly accessing
mutex->owner treat is as volative with the ACCESS_ONCE() helper.
Without this you may get a stale cached value when accessing it
from different cpus.  This can result in incorrect behavior from
mutex_owned() and mutex_owner().  This is not a problem for the
!HAVE_MUTEX_OWNER case because in this case all the accesses
are covered by a spin lock which similarly gaurentees we will
not be accessing stale data.

Secondly, check CONFIG_SMP before allowing access to mutex->owner.
I see that for non-SMP setups the kernel does not track the owner
so we cannot rely on it.

Thirdly, check CONFIG_MUTEX_DEBUG when this is defined and the
HAVE_MUTEX_OWNER is defined surprisingly the mutex->owner will
not be cleared on mutex_exit().  When this is the case the SPL
needs to make sure to do it to ensure MUTEX_HELD() behaves as
expected or you will certainly assert in mutex_destroy().

Finally, improve the mutex regression tests.  For mutex_owned() we
now minimally check that it behaves correctly when checked from the
owner thread or the non-owner thread.  This subtle behaviour has bit
me before and I'd like to catch it early next time if it reappears.

As for mutex_owned() regression test additonally verify that
mutex->owner is always cleared on mutex_exit().
2010-06-28 16:02:57 -07:00
Brian Behlendorf 616df2dd8b Fix subtle race in threads test case
The call to wake_up() must be moved under the spin lock because
once we drop the lock 'tp' may no longer be valid because the
creating thread has exited.  This basic thread implementation
was correct, this was simply a flaw in the test case.
2010-06-28 12:34:20 -07:00
Brian Behlendorf 5be4767ae1 Accept but ignore TASKQ_DC_BATCH and TQ_FRONT
For the moment the SPL accepts the TASKQ_DC_BATCH and TQ_FRONT
flags however they get silently ignored.  This is harmless for
the moment but it does need to be implemented at some point.
2010-06-28 11:39:43 -07:00
Brian Behlendorf e6de04b73c Add kmem_vasprintf function
We might as well have both asprintf() variants.  This allows us
to safely pass a va_list through several levels of the stack
using va_copy() instead of va_start().
2010-06-24 09:41:59 -07:00
Brian Behlendorf 438683c0a9 Revert "Support TQ_FRONT flag used by taskq_dispatch()"
This reverts commit eb12b3782c.
2010-06-21 10:19:44 -07:00
Brian Behlendorf 3cb77549d1 Update warnings in kmem debug code
This fix was long overdue.  Most of the ground work was laid long
ago to include the exact function and line number in the error message
which there was an issue with a memory allocation call.  However,
probably due to lack of time at the moment that informatin never
made it in to the error message.  This patch fixes that and trys
to standardize the kmem debug messages as well.
2010-06-16 16:01:16 -07:00
Brian Behlendorf 8ffef449ef Add missing header util/sscanf.h 2010-06-14 14:20:31 -07:00
Brian Behlendorf def465ad4b Include kstat.h from kmem.h
It turns out Solaris incidentally includes kstat.h from kmem.h.  As
a side effect of this certain higher level .c files which should
explicitly include kstat.h don't because they happen to get it
via kmem.h.  To make like easier for everyone I do the same.
2010-06-14 14:18:48 -07:00
Brian Behlendorf eb12b3782c Support TQ_FRONT flag used by taskq_dispatch()
Allow taskq_dispatch() to insert work items at the head of the
queue instead of just the tail by passing the TQ_FRONT flag.
2010-06-11 15:57:25 -07:00
Brian Behlendorf 32c6147dee Minor cleanup and Solaris API additions.
Minor formatting cleanups.

API additions:
* {U}INT8_{MIN,MAX}, {U}INT16_{MIN,MAX} macros.
* id_t typedef
* ddi_get_lbolt(), ddi_get_lbolt64() functions.
2010-06-11 15:57:25 -07:00
Brian Behlendorf b868e22f05 Add kmem_asprintf(), strfree(), strdup(), and minor cleanup.
This patch adds three missing Solaris functions: kmem_asprintf(), strfree(),
and strdup().  They are all implemented as a thin layer which just calls
their Linux counterparts.  As part of this an autoconf check for kvasprintf
was added because it does not appear in older kernels.  If the kernel does
not provide it then spl-generic implements it.

Additionally the dead DEBUG_KMEM_UNIMPLEMENTED code was removed to clean
things up and make the kmem.h a little more readable.
2010-06-11 15:57:25 -07:00
Brian Behlendorf bb1bb2c4c4 Add xuio_* structures and typedefs.
Add the basic xuio structure and typedefs for Solaris style zero copy.
There's a decent chance this will not be the way I handle this on Linux
but providing the basic types simplifies things for now.
2010-06-11 15:57:25 -07:00
Brian Behlendorf 750a7101f8 Stub out additional missing headers 2010-06-11 15:57:25 -07:00
Brian Behlendorf ae4c36adce Cleanly split Linux proc.h (fs) from conflicting Solaris proc.h (process)
Under linux the proc.h header is for the /proc filesystem, and under
Solaris the proc/h header if for processes.  This patch correctly
moves the Linux proc functionality in a linux/proc_compat.h header
and leaves the sys/proc.h for use by Solaris.  Minor updates were
required to all the call sites where it was included of course.
2010-06-11 15:57:25 -07:00
Brian Behlendorf 71b1242e67 Update META to version 0.5.0 2010-06-11 15:57:25 -07:00
Alex Zhuravlev 1b4ad25e2f Stack overflow on 64-bit modulus operations on 32-bit architectures.
Running 'zpool create' on a 32-bit machine with an SPL compiled with
gcc 4.4.4 led to a stack overlow.  This turned out to be due to some
sort of 'optimization' by gcc:

uint64_t __umoddi3(uint64_t dividend, uint64_t divisor)
{
   return dividend - divisor * (dividend / divisor);
}

This code was supposed to be using __udivdi3 to implement /, but gcc
instead implemented it via __umoddi3 itself.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-06-03 09:06:55 -07:00
Brian Behlendorf 8a1c9a02fb Minor 32-bit fix cast to hrtime_t before the mutliply.
It's important to cast to hrtime_t before doing the multiply because
the ts.tv_sec type is only 32-bits and we need to promote it to 64-bits.
2010-05-23 09:51:17 -07:00
Brian Behlendorf 49638d8388 Refresh autogen.sh products with automake 1.11.1. 2010-05-21 15:52:06 -07:00
Brian Behlendorf 3cca28a785 Re-Prep for 0.4.9 tag with a few more fixes and updated ChangeLog 2010-05-21 14:17:44 -07:00
Brian Behlendorf edbbb609bd Minor spec file cleanup for RHEL6 package dependency. 2010-05-21 11:53:49 -07:00
Brian Behlendorf 32f5faff69 Simplify rwlock implementation.
Remove RW_COUNT() from the rwlock implementation.  The idea was that it
could be used as a generic wrapper for getting at the internal state
of a rwlock.  While a good idea it's proven problematic to keep it
correct for multiple archs and internal implementation changes.  In
short it hasn't been worth the trouble.

With that and simplicity in mind things have been updated to use the
rwsem_is_locked() function instead of RW_COUNT for the RW_*_HELD()
functions.  As for rw_upgrade() it remains only implemented for
the generic rwsem implemenation.  It remains to be determined if its
worth the effort of adding a custom implementation for each arch.
2010-05-20 14:20:34 -07:00
Brian Behlendorf 23d91792ef Use KM_NODEBUG macro in preference to __GFP_NOWARN. 2010-05-20 14:16:59 -07:00
Brian Behlendorf 3626ae6a70 Disable spl_debug_panic_on_bug by default.
While I may prefer to have the system panic on an SBUG and to get
crash dump for analysis.  I suspect most peoples systems are not
configured from crash dump and the best thing to so is to simply
halt the thread and print an error to the console.  This way they
have a good chance of actually saving the stack trace and debug log.
2010-05-20 10:15:51 -07:00
Brian Behlendorf e0dcb22e4e Adjust 'large' object sizes in kmem:slab_large test.
64K objects are large for a kmem based slab (2M slabs)
1M objects are large for a vmem cased slab (32M slabs)
2010-05-20 09:52:37 -07:00
Brian Behlendorf 5198ea0e71 Remove kmem_set_warning() interface replace with __GFP_NOWARN flag.
Remove the kmem_set_warning() hack used by the kmem-splat regression
tests with a per-allocation flag called __GFP_NOWARN.  This matches
the lower level linux flag of similar by slightly different function.
The idea is you can then explicitly set this flag on requests where
you know your breaking the max 8k rule but you need/want to do it
anyway.

This is currently used by the regression tests where we intentionally
push things to the limit but don't want the log noise.  Additionally,
we are forced to use it in spl_kmem_cache_create() because by default
NR_CPUS is very large and theres no easy way to handle that.

Finally, I've added a stack_dump() call to the warning when it is
trigger to make to clear exactly where the allocation is taking place.
2010-05-19 16:53:13 -07:00
Brian Behlendorf 627a74972c Set default debug log patch to /tmp/spl-log.
Using /tmp/ is a preferable default, it can always be overriden
using the module option on a case-by-case basis.

Additionally standardize some log messages based on the same
default log level used by the kernel.
2010-05-19 16:17:06 -07:00
Brian Behlendorf 99879b257c Minor spec file cleanup for srpm case.
Ensure kdevpkg is defined is srpm case before using it to define
the devel_requires macro.  Interestingly this is not an issue for
rpm-4.7.1-4 but it is for rpm-4.4.2.3-18.
2010-05-18 09:18:20 -07:00
Brian Behlendorf de7cc34821 Prep for 0.4.9 tag, updated META and ChangeLog 2010-05-17 15:47:24 -07:00
Brian Behlendorf 716154c592 Public Release Prep
Updated AUTHORS, COPYING, DISCLAIMER, and INSTALL files.  Added
standardized headers to all source file to clearly indicate the
copyright, license, and to give credit where credit is due.
2010-05-17 15:18:00 -07:00
Brian Behlendorf 8e2140b770 Add 3 missing typedefs.
Add processorid_t, pc_t, index_t.
2010-05-14 09:42:53 -07:00
Brian Behlendorf a76df2dc0f Add console_*printf() functions.
Add support for the missing console_vprintf() and console_printf()
functions.
2010-05-14 09:40:52 -07:00
Brian Behlendorf 6020190e8f Use do_posix_clock_monotonic_gettime() as described by comment.
While this does incur slightly more overhead we should be using
do_posix_clock_monotonic_gettime() for gethrtime() as described
by the existing comment.
2010-05-14 09:31:22 -07:00
Brian Behlendorf f752b46eb3 Add cv_wait_interruptible() function.
This is a minor extension to the condition variable API to allow
for reasonable signal handling on Linux.  The cv_wait() function by
definition must wait unconditionally for cv_signal()/cv_broadcast()
before waking it.  This makes it impossible to woken by a signal
such as SIGTERM.  The cv_wait_interruptible() function was added
to handle this case.  It behaves identically to cv_wait() with the
exception that it waits interruptibly allowing a signal to wake it
up.  This means you do need to be careful and check issig() after
waking.
2010-05-14 09:24:51 -07:00
Brian Behlendorf 97f8f6d789 Dump log from current process when required
When dumping a debug log first check that it is safe to create
a new thread and block waiting for it.  If we are in an atomic
context or irqs and disabled it is not safe to sleep and we
must write out of the debug log from the current process.
2010-04-23 15:55:02 -07:00
Brian Behlendorf d05ec4b45f Assume TQ_SLEEP when not explicitly specified. 2010-04-23 14:39:47 -07:00
Ricardo Correia 663e02a135 Handle the FAPPEND option in vn_rdwr().
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2010-04-23 14:39:42 -07:00
Brian Behlendorf 82a358d9c0 Update vn_set_pwd() to allow user|kernal address for filename
During module init spl_setup()->The vn_set_pwd("/") was failing
with -EFAULT because user_path_dir() and __user_walk() both
expect 'filename' to be a user space address and it's not in
this case.  To handle this the data segment size is increased
to to ensure strncpy_from_user() does not fail with -EFAULT.

Additionally, I've added a printk() warning to catch this and
log it to the console if it ever reoccurs.  I thought everything
was working properly here because there consequences of this
failing are subtle and usually non-critical.
2010-04-22 12:53:58 -07:00
Brian Behlendorf ef6c136884 Disable rw_tryupgrade() for newer kernels
For kernels using the CONFIG_RWSEM_GENERIC_SPINLOCK implementation
nothing has changed.  But if your kernel is building with arch
specific rwsems rw_tryupgrade() has been disabled until it can
be implemented correctly.  In particular, the x86 implementation
now leverages atomic primatives for serialization rather than
spinlocks.  So to get this working again it will need to be
implemented as a cmpxchg for x86 and likely something similiar
for other arches we are interested in.  For now it's safest
to simply disable it.
2010-04-22 12:28:19 -07:00
Brian Behlendorf 8934764e60 Add support for 'make -s' silent builds
The cleanest way to do this is to set AM_LIBTOOLFLAGS = --silent.  However,
AM_LIBTOOLFLAGS is not honored by automake-1.9.6-2.1 which is what I have
been using.  To cleanly handle this I am updating to automake-1.11-3 which
is why it looks like there is a lot of churn in the Makefiles.
2010-03-26 15:41:17 -07:00
Brian Behlendorf 16b719f006 Allow spl_config.h to be included by dependant packages (updated)
We need dependent packages to be able to include spl_config.h to
build properly.  This was partially solved in commit 0cbaeb1 by using
AH_BOTTOM to #undef common #defines (PACKAGE, VERSION, etc) which
autoconf always adds and cannot be easily removed.  This solution
works as long as the spl_config.h is included before your projects
config.h.  That turns out to be easier said than done.  In particular,
this is a problem when your package includes its config.h using the
-include gcc option which ensures the first thing included is your
config.h.

To handle all cases cleanly I have removed the AH_BOTTOM hack and
replaced it with an AC_CONFIG_HEADERS command.  This command runs
immediately after spl_config.h is written and with a little awk-foo
it strips the offending #defines from the file.  This eliminates
the problem entirely and makes header safe for inclusion.

Also in this change I have removed the few places in the code where
spl_config.h is included.  It is now added to the gcc compile line
to ensure the config results are always available.

Finally, I have also disabled the verbose kernel builds.  If you
want them back you can always build with 'make V=1'.  Since things
are working now they don't need to be on by default.
2010-03-22 14:45:33 -07:00
Brian Behlendorf aa600d8a38 Reduce max kmem based slab size
Allowing MAX_ORDER-1 sized allocations for kmem based slabs have
been observed to result in deadlocks.  To help prvent this limit
max kmem based slab size to MAX_ORDER-3.  Just for the record
callers should not be creating slabs like this, but if they do
we should still handle it as safely as we can.
2010-03-18 13:39:51 -07:00
Brian Behlendorf c5c3d402f7 Prep for 0.4.8 tag, updated META and ChangeLog 2010-03-11 15:39:33 -08:00
Brian Behlendorf 6f088fde27 Ignore unsigned module build products
Along with the addition of signed kernel modules in newer kernel
we have a few new build products we need to ignore.   LKLM has the
whole thread for those interested: http://lkml.org/lkml/2007/2/14/164
2010-03-11 14:29:17 -08:00
Brian J. Murrell 3cce0f1365 Fix definitions for the unknown distro/installation
If the distro/installation really is unsupported (i.e. unknown) we should
not make it look like a known distribution (i.e. RHEL) complete with
dependencies on other RPMs and trying to find kenrel source in the RH
standard location.

Additionally add 'k' prefix for kernel requires for consistency.
2010-03-08 15:16:55 -08:00
Brian J. Murrell 534c4e38cb When no kernel source has been pointed to, first attempt to use
/lib/modules/$(uname -r)/source.  This will likely fail when building
under a mock (http://fedoraproject.org/wiki/Projects/Mock) chroot
environment since `uname -r` will report the running kernel which
likely is not the kernel in your chroot.  To cleanly handle this
we fallback to using the first kernel in your chroot.

The kernel-devel package which contains all the kernel headers and
a few build products such as Module.symver{s} is all the is required.
Full source is not needed.
2010-03-08 14:19:30 -08:00
Brian Behlendorf 21006d08af Remove Module.markers and Module.symver{s} in clean target
Split 'modules' and 'clean' Makefile targets to allow us to
cleanly remove the Module.* build products with a 'make clean'.
2010-03-08 13:39:57 -08:00
Brian Behlendorf 3977f8370f Linux 2.6.32 compat, proc_handler() API change
As of linux-2.6.32 the 'struct file *filp' argument was dropped from
the proc_handle() prototype.  It was apparently unused _almost_
everywhere in the kernel and this was simply cleanup.

I've added a new SPL_AC_5ARGS_PROC_HANDLER autoconf check for this and
the proper compat macros to correctly define the prototypes and some
helper functions.  It's not pretty but API compat changes rarely are.
2010-03-04 12:14:56 -08:00
Ricardo M. Correia 694921bc49 sun-misc-gitignore
Add .gitignore files.

Signed-off-by: Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>
2010-01-08 09:37:54 -08:00
Ricardo M. Correia f7e8739c94 sun-fix-whitespace
Whitespace fixes.

Signed-off-by: Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>
2010-01-08 09:37:54 -08:00
Ricardo M. Correia b520b14305 sun-fix-panic-str
Fix panic() string, which was being used as a format string, instead of an already-formatted string.

Signed-off-by: Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>
2010-01-08 09:37:54 -08:00
Brian Behlendorf 5562e5d105 Added splat taskq task ordering test case.
This test case verifies the correct behavior of taskq_wait_id().
In particular it ensure the the following two cases are handled
properly:

1) Task ids larger than the waited for task id can run and
   complete as long as there is an available worker thread.
2) All task ids lower than the waited one must complete before
   unblocking even if the waited task id itself has completed.
2010-01-05 13:34:09 -08:00
Brian Behlendorf 82387586af Optimize lowest outstanding taskqid calculation in taskq_lowest_id()
In the initial version of taskq_lowest_id() the entire pending and
work list was locked under the tq->tq_lock to determine the lowest
outstanding taskqid.  At the time this done because I was rushed
and wanted to make sure it was right... fast was secondary.  Well now
fast is important too so I carefully thought through the pending
and work list management and convinced myself it is safe and correct
to simply check the first entry.  I added a large comment to the source
to explain this.  But basically as long as we are careful to ensure the
pending and work list stay sorted this is safe and fast.

The motivation for this chance was that I was observing as much as
10% of the total CPU time go to waiting on the tq->tq_lock when the
pending list was long.  This resolves that problems and frees up
that CPU time for something useful.
2010-01-04 15:52:26 -08:00
Brian Behlendorf ef1c7a0691 Strip __GFP_ZERO from kmalloc it is not available for older kernels.
This is needed to avoid a BUG_ON() on RHEL5.4 kernel 2.6.18-164.6.1,
since __GFP_ZERO is not a valid flag for kmalloc().
2009-12-23 12:57:10 -08:00
Brian Behlendorf 641bebe35f Fix kmem:slab_overcommit regression test locking
This regression test could crash in splat_kmem_cache_test_reclaim()
due to a race between the slab relclaim and the normal exiting of
the thread.  Specifically, the kct structure could be free'd by
the thread performing the allocations while the reclaim function
was also working on that's threads kct structure.  The simplest
fix is to extend the kcp->kcp_lock over the reclaim to prevent
the kct from being freed.  A better fix would be to ref count
these structures, but since is just a regression this locking
change is enough.  Surprisingly this was only observed commonly
under RHEL5.4 but all platform could have hit this.
2009-12-23 12:46:11 -08:00
Brian Behlendorf 3a03ce5cbf Check for changed gaurd macro in 2.6.28+ for rwsem implementation.
As part of the 2.6.28 cleanup which moved all the linux/include/asm/
headers in to linux/arch, the guard headers for many header files
changed.  The i386 rwsem implementation keys off this header to
ensure the internal members of the rwsem structure are interpreted
correctly.  This change checks for the new guard macro in addition
to the only one, the implementation of the rwsem has not changed
for i386 so this is safe and correct.
2009-12-17 11:57:44 -08:00
Brian Behlendorf 242f539a2e Add skc_flags and full header to /proc/spl/kmem/slab. 2009-12-11 11:20:08 -08:00
Brian Behlendorf f60a5f5221 Splat vnode tests must return negative error codes.
I must have been in a hurry when I wrote the vnode regression tests
because the error code handling is not correct.  The Solaris vnode
API returns positive errno's, these need to be converted to negative
errno's for Linux before being passed back to user space.  Otherwise
the test hardness with report the failure but errno will not be set
with the correct error code.

Additionally tests 3, 4, 6, and 7 may fail in the test file already
exists.  To avoid false positives a user mode helper has added to
remove the test files in /tmp/ before running the actual test.
2009-12-10 15:06:07 -08:00
Brian Behlendorf d04c8a563c Atomic64 compatibility for 32-bit systems without kernel support.
This patch is another step towards updating the code to handle the
32-bit kernels which I have not been regularly testing.  This changes
do not really impact the common case I'm expected which is the latest
kernel running on an x86_64 arch.

Until the linux-2.6.31 kernel the x86 arch did not have support for
64-bit atomic operations.  Additionally, the new atomic_compat.h support
for this case was wrong because it embedded a spinlock in the atomic
variable which must always and only be 64-bits total.  To handle these
32-bit issues we now simply fall back to the --enable-atomic-spinlock
implementation if the kernel does not provide the 64-bit atomic funcs.

The second issue this patch addresses is the DEBUG_KMEM assumption that
there will always be atomic64 funcs available.  On 32-bit archs this may
not be true, and actually that's just fine.  In that case the kernel will
will never be able to allocate more the 32-bits worth anyway.  So just
check if atomic64 funcs are available, if they are not it means this
is a 32-bit machine and we can safely use atomic_t's instead.
2009-12-04 15:54:12 -08:00
Brian Behlendorf db1aa22297 Correctly handle division on 32-bit RHEL5 systems by returning dividend. 2009-12-01 15:53:28 -08:00
Brian Behlendorf 5652e7b497 When using x86 specific rwsem correctly intepret rwsem->count. 2009-12-01 15:47:27 -08:00
Brian Behlendorf 4e5691faf6 Only run the kmem overcommit test on 64-bit systems. 2009-12-01 11:40:47 -08:00
Brian Behlendorf a5d6f6020a Add missing atomic64 compat helpers for 32-bit systems.
The use of these functions was added with the recent atomic work
and not tested on 32-bit systems.  Add the missing compat functions:
atomic64_inc, atomic64_dec, atomic64_add_return, atomic64_sub_return,
atomic64_inc_return, atomic64_dec_return.
2009-12-01 10:15:27 -08:00
Brian Behlendorf 6ff686c44d Type long expected explicitly cast for 32-bit systems. 2009-12-01 10:14:01 -08:00
Brian Behlendorf f6ea161924 spl-modules-devel package must depend on the exact version of kernel
devel package it was built against.
2009-11-24 15:24:36 -08:00
Brian Behlendorf c1541dfef1 Add 'srpm' --with-config option for creation of spec files. 2009-11-24 14:21:45 -08:00
Brian Behlendorf ea385742db Add chaos5 and rhel6 macro's to the spl-modules.spec.in 2009-11-24 13:15:35 -08:00
Brian Behlendorf 958dc9e737 Prep for 0.4.7 tag, updated META and ChangeLog. 2009-11-20 16:52:29 -08:00
Brian Behlendorf fe883092b9 Ensure *.order and *.markers build products are removed by distclean rule. 2009-11-20 16:01:00 -08:00
Brian Behlendorf 0a6c005959 Ensure spl_config.h is include in spl-generic.c 2009-11-15 15:04:33 -08:00
Brian Behlendorf 1273cf284b Always use the generic mutex_destroy(). 2009-11-15 15:04:02 -08:00
Brian Behlendorf 05b48408fb Add mutex_enter_nested() as wrapper for mutex_lock_nested().
This symbol can be used by GPL modules which use the SPL to handle
cases where a call path takes a two different locks by the same
name.  This is needed to avoid a false positive in the lock checker.
2009-11-15 14:27:15 -08:00
Brian Behlendorf 8b45dda2bc Linux 2.6.31 kmem cache alignment fixes and cleanup.
The big fix here is the removal of kmalloc() in kv_alloc().  It used
to be true in previous kernels that kmallocs over PAGE_SIZE would
always be pages aligned.  This is no longer true atleast in 2.6.31
there are no longer any alignment expectations.  Since kv_alloc()
requires the resulting address to be page align we no only either
directly allocate pages in the KMC_KMEM case, or directly call
__vmalloc() both of which will always return a page aligned address.
Additionally, to avoid wasting memory size is always a power of two.

As for cleanup several helper functions were introduced to calculate
the aligned sizes of various data structures.  This helps ensure no
case is accidentally missed where the alignment needs to be taken in
to account.  The helpers now use P2ROUNDUP_TYPE instead of P2ROUNDUP
which is safer since the type will be explict and we no longer count
on the compiler to auto promote types hopefully as we expected.

Always wnforce minimum (SPL_KMEM_CACHE_ALIGN) and maximum (PAGE_SIZE)
alignment restrictions at cache creation time.

Use SPL_KMEM_CACHE_ALIGN in splat alignment test.
2009-11-13 11:12:43 -08:00
Brian Behlendorf c89fdee4d3 Remove __GFP_NOFAIL in kmem and retry internally.
As of 2.6.31 it's clear __GFP_NOFAIL should no longer be used and it
may disappear from the kernel at any time.  To handle this I have simply
added *_nofail wrappers in the kmem implementation which perform the
retry for non-atomic allocations.

From linux-2.6.31 mm/page_alloc.c:1166
/*
 * __GFP_NOFAIL is not to be used in new code.
 *
 * All __GFP_NOFAIL callers should be fixed so that they
 * properly detect and handle allocation failures.
 *
 * We most definitely don't want callers attempting to
 * allocate greater than order-1 page units with
 * __GFP_NOFAIL.
 */
WARN_ON_ONCE(order > 1);
2009-11-12 15:11:24 -08:00
Brian Behlendorf baf2979ed3 Linux 2.6.31 Compatibility Updates
SPL_AC_2ARGS_SET_FS_PWD macro updated to explicitly include
linux/fs_struct.h which was dropped from linux/sched.h.

min_wmark_pages, low_wmark_pages, high_wmark_pages macros
introduced in newer kernels.  For older kernels mm_compat.h
was introduced to define them as needed as direct mappings
to per zone min_pages, low_pages, max_pages.
2009-11-10 14:06:57 -08:00
Brian Behlendorf f97cd5fd87 Prep for 0.4.6 tag, updated META and ChangeLog. 2009-11-02 10:24:12 -08:00
Brian Behlendorf 055ffd98cf Autoconf --enable-debug-* cleanup
Cleanup the --enable-debug-* configure options, this has been pending
for quite some time and I am glad I finally got to it.  To summerize:

1) All SPL_AC_DEBUG_* macros were updated to be a more autoconf
friendly.  This mainly involved shift to the GNU approved usage of
AC_ARG_ENABLE and ensuring AS_IF is used rather than directly using
an if [ test ] construct.

2) --enable-debug-kmem=yes by default.  This simply enabled keeping
a running tally of total memory allocated and freed and reporting a
memory leak if there was one at module unload.  Additionally, it
ensure /proc/spl/kmem/slab will exist by default which is handy.
The overhead is low for this and it should not impact performance.

3) --enable-debug-kmem-tracking=no by default.  This option was added
to provide a configure option to enable to detailed memory allocation
tracking.  This support was always there but you had to know where to
turn it on.  By default this support is disabled because it is known
to badly hurt performence, however it is invaluable when chasing a
memory leak.

4) --enable-debug-kstat removed.  After further reflection I can't see
why you would ever really want to turn this support off.  It is now
always on which had the nice side effect of simplifying the proc handling
code in spl-proc.c.  We can now always assume the top level directory
will be there.

5) --enable-debug-callb removed.  This never really did anything, it was
put in provisionally because it might have been needed.  It turns out
it was not so I am just removing it to prevent confusion.
2009-10-30 13:58:51 -07:00
Brian Behlendorf 302b88e6ab Add autoconf checks for atomic64_cmpxchg + atomic64_xchg
These functions didn't exist for all archs prior to 2.6.24.  This
patch addes an autoconf test to detect this and add them when needed.
The autoconf check is needed instead of just an #ifndef because in
the most modern kernels atomic64_{cmp}xchg are implemented as in
inline function and not a #define.
2009-10-30 13:53:17 -07:00
Brian Behlendorf 5e9b5d832b Use Linux atomic primitives by default.
Previously Solaris style atomic primitives were implemented simply by
wrapping the desired operation in a global spinlock.  This was easy to
implement at the time when I wasn't 100% sure I could safely layer the
Solaris atomic primatives on the Linux counterparts.  It however was
likely not good for performance.

After more investigation however it does appear the Solaris primitives
can be layered on Linux's fairly safely.  The Linux atomic_t type really
just wraps a long so we can simply cast the Solaris unsigned value to
either a atomic_t or atomic64_t.  The only lingering problem for both
implementations is that Solaris provides no atomic read function.  This
means reading a 64-bit value on a 32-bit arch can (and will) result in
word breaking.  I was very concerned about this initially, but upon
further reflection it is a limitation of the Solaris API.  So really
we are just being bug-for-bug compatible here.

With this change the default implementation is layered on top of Linux
atomic types.  However, because we're assuming a lot about the internal
implementation of those types I've made it easy to fall-back to the
generic approach.  Simply build with --enable-atomic_spinlocks if
issues are encountered with the new implementation.
2009-10-30 10:55:25 -07:00
Brian Behlendorf 2b5adaf18f I should not have removed these, they are important. 2009-10-27 16:17:06 -07:00
Brian Behlendorf 4bd577d069 Rebase cmn_err on vcmn_err and don't warn about missing \n
The cmn_err/vcmn_err functions are layered on top of the debug
system which usually expects a newline at the end.  However, there
really doesn't need to be a newline there and there in fact should
not be for the CE_CONT case so let's just drop the warning.

Also we make a half-hearted attempt to handle a leading ! which
means only send it to the syslog not the console.  In this case
we just send to the the debug logs and not the console.
2009-10-27 16:13:35 -07:00
Brian Behlendorf f44078fad5 Remove usage of the __id_u macro for portability.
This macro was removed from the default RPM macro file.  Interestly,
some of the arch specific macro's add it back it based on your distro
but it should not be counted on.  However, __id still exists and its
command line args have historically been fairly stable so we will
directly use %{__id} -un to get the user name.
2009-10-05 12:51:58 -07:00
Brian Behlendorf 39ab544079 Use kobject_set_name() for increased portability.
As of 2.6.25 kobj->k_name was replaced with kobj->name.  Some distros
such as RHEL5 (2.6.18) add a patch to prevent this from being a problem
but other older distros such as SLES10 (2.6.16) have not.  To avoid
the whole issue I'm updating the code to use kobject_set_name() which
does what I want and has existed all the way back to 2.6.11.
2009-10-02 16:21:59 -07:00
Brian Behlendorf 51a727e90f Set cwd to '/' for the process executing insmod.
Ricardo has pointed out that under Solaris the cwd is set to '/'
during module load, while under Linux it is set to the callers cwd.
To handle this cleanly I've reworked the module *_init()/_exit()
macros so they call a *_setup()/_cleanup() function when any SPL
dependent module is loaded or unloaded.  This gives us a chance to
perform any needed modification of the process, in this case changing
the cwd.  It also handily provides a way to avoid creating wrapper
init()/exit() functions because the Solaris and Linux prototypes
differ slightly.  All dependent modules should now call the spl
helper macros spl_module_{init,exit}() instead of the native linux
versions.

Unfortunately, it appears that under Linux there has been no consistent
API in the kernel to set the cwd in a module.  Because of this I have
had to add more autoconf magic than I'd like.  However, what I have
done is correct and has been tested on RHEL5, SLES11, FC11, and CHAOS
kernels.

In addition, I have change the rootdir type from a 'void *' to the
correct 'vnode_t *' type.  And I've set rootdir to a non-NULL value.
2009-10-01 16:06:15 -07:00
Brian Behlendorf 0e77fc118e Expand SEM() outside init_rwsem and directly call __init_rwsem().
We need to directly call __init_rwsem() or the name gets expanded
to SEM(lock-name).  This is safe and correct for the support arches
x86/x86_64/ppc/ppc64.
2009-09-29 03:19:09 -07:00
Brian Behlendorf 4d54fdee1d Reimplement mutexs for Linux lock profiling/analysis
For a generic explanation of why mutexs needed to be reimplemented
to work with the kernel lock profiling see commits:
  e811949a57 and
  d28db80fd0

The specific changes made to the mutex implemetation are as follows.
The Linux mutex structure is now directly embedded in the kmutex_t.
This allows a kmutex_t to be directly case to a mutex struct and
passed directly to the Linux primative.

Just like with the rwlocks it is critical that these functions be
implemented as '#defines to ensure the location information is
preserved.  The preprocessor can then do a direct replacement of
the Solaris primative with the linux primative.

Just as with the rwlocks we need to track the lock owner.  Here
things get a little more interesting because depending on your
kernel version, and how you've built your kernel Linux may already
do this for you.  If your running a 2.6.29 or newer kernel on a
SMP system the lock owner will be tracked.  This was added to Linux
to support adaptive mutexs, more on that shortly.  Alternately, your
kernel might track the lock owner if you've set CONFIG_DEBUG_MUTEXES
in the kernel build.  If neither of the above things is true for
your kernel the kmutex_t type will include and track the lock owner
to ensure correct behavior.  This is all handled by a new autoconf
check called SPL_AC_MUTEX_OWNER.

Concerning adaptive mutexs these are a very recent development and
they did not make it in to either the latest FC11 of SLES11 kernels.
Ideally, I'd love to see this kernel change appear in one of these
distros because it does help performance.  From Linux kernel commit:
  0d66bf6d3514b35eb6897629059443132992dbd7
  "Testing with Ingo's test-mutex application...
  gave a 345% boost for VFS scalability on my testbox"
However, if you don't want to backport this change yourself you
can still simply export the task_curr() symbol.  The kmutex_t
implementation will use this symbol when it's available to
provide it's own adaptive mutexs.

Finally, DEBUG_MUTEX support was removed including the proc handlers.
This was done because now that we are cleanly integrated with the
kernel profiling all this information and much much more is available
in debug kernel builds.  This code was now redundant.

Update mutexs validated on:
    - SLES10   (ppc64)
    - SLES11   (x86_64)
    - CHAOS4.2 (x86_64)
    - RHEL5.3  (x86_64)
    - RHEL6    (x86_64)
    - FC11     (x86_64)
2009-09-25 14:47:01 -07:00
Brian Behlendorf d28db80fd0 Update rwlocks to track owner to ensure correct semantics
The behavior of RW_*_HELD was updated because it was not quite right.
It is not sufficient to return non-zero when the lock is help, we must
only do this when the current task in the holder.

This means we need to track the lock owner which is not something
tracked in a Linux semaphore.  After some experimentation the
solution I settled on was to embed the Linux semaphore at the start
of a larger krwlock_t structure which includes the owner field.
This maintains good performance and allows us to cleanly intergrate
with the kernel lock analysis tools.  My reasons:

1) By placing the Linux semaphore at the start of krwlock_t we can
then simply cast krwlock_t to a rw_semaphore and pass that on to
the linux kernel.  This allows us to use '#defines so the preprocessor
can do direct replacement of the Solaris primative with the linux
equivilant.  This is important because it then maintains the location
information for each rw_* call point.

2) Additionally, by adding the owner to krwlock_t we can keep this
needed extra information adjacent to the lock itself.  This removes
the need for a fancy lookup to get the owner which is optimal for
performance.  We can also leverage the existing spin lock in the
semaphore to ensure owner is updated correctly.

3) All helper functions which do not need to strictly be implemented
as a define to preserve location information can be done as a static
inline function.

4) Adding the owner to krwlock_t allows us to remove all memory
allocations done during lock initialization.  This is good for all
the obvious reasons, we do give up the ability to specific the lock
name.  The Linux profiling tools will stringify the lock name used
in the code via the preprocessor and use that.

Update rwlocks validated on:
- SLES10   (ppc64)
- SLES11   (x86_64)
- CHAOS4.2 (x86_64)
- RHEL5.3  (x86_64)
- RHEL6    (x86_64)
- FC11     (x86_64)
2009-09-25 14:14:35 -07:00
Brian Behlendorf e811949a57 Reimplement rwlocks for Linux lock profiling/analysis.
It turns out that the previous rwlock implementation worked well but
did not integrate properly with the upstream kernel lock profiling/
analysis tools.  This is a major problem since it would be awfully
nice to be able to use the automatic lock checker and profiler.

The problem is that the upstream lock tools use the pre-processor
to create a lock class for each uniquely named locked.  Since the
rwsem was embedded in a wrapper structure the name was always the
same.  The effect was that we only ended up with one lock class for
the entire SPL which caused the lock dependency checker to flag
nearly everything as a possible deadlock.

The solution was to directly map a krwlock to a Linux rwsem using
a typedef there by eliminating the wrapper structure.  This was not
done initially because the rwsem implementation is specific to the arch.
To fully implement the Solaris krwlock API using only the provided rwsem
API is not possible.  It can only be done by directly accessing some of
the internal data member of the rwsem structure.

For example, the Linux API provides a different function for dropping
a reader vs writer lock.  Whereas the Solaris API uses the same function
and the caller does not pass in what type of lock it is.  This means to
properly drop the lock we need to determine if the lock is currently a
reader or writer lock.  Then we need to call the proper Linux API function.
Unfortunately, there is no provided API for this so we must extracted this
information directly from arch specific lock implementation.  This is
all do able, and what I did, but it does complicate things considerably.

The good news is that in addition to the profiling benefits of this
change.  We may see performance improvements due to slightly reduced
overhead when creating rwlocks and manipulating them.

The only function I was forced to sacrafice was rw_owner() because this
information is simply not stored anywhere in the rwsem.  Luckily this
appears not to be a commonly used function on Solaris, and it is my
understanding it is mainly used for debugging anyway.

In addition to the core rwlock changes, extensive updates were made to
the rwlock regression tests.  Each class of test was extended to provide
more API coverage and to be more rigerous in checking for misbehavior.

This is a pretty significant change and with that in mind I have been
careful to validate it on several platforms before committing.  The full
SPLAT regression test suite was run numberous times on all of the following
platforms.  This includes various kernels ranging from 2.6.16 to 2.6.29.

- SLES10   (ppc64)
- SLES11   (x86_64)
- CHAOS4.2 (x86_64)
- RHEL5.3  (x86_64)
- RHEL6    (x86_64)
- FC11     (x86_64)
2009-09-18 16:09:47 -07:00
Brian Behlendorf 73358d5a1d Various spec file tweaks to handle rpm building of several distros.
Supported and tested distros now include SLES10, SLES11, Chaos 4.x,
RHEL5, and Fedora 11.  This update was mainly to address rebuildable
kernel module rpms, and correct rpm dependencies for each distro.
2009-08-14 14:09:16 -07:00
Brian Behlendorf 26d77c4493 Explicit check for requires_* rpm defines
Due to different distros and/or versions of rpm mishandling the shorthand
syntax simply use the longer version which get interpreted correctly.
2009-08-13 15:02:34 -07:00
Brian Behlendorf 68ada11e5c Tag spl-0.4.5.
Update the ChangeLog with a summary of the changes since the last release
and update the META file to reflect the new version number.
2009-08-04 12:22:33 -07:00
Brian Behlendorf 16f4a92c10 Required missing symbols for FC11 kernels (2.6.29.4-167.fc11.x86_64) 2009-07-31 12:44:34 -07:00
Brian Behlendorf c65d62d8bf Disable stack overflow checking by default.
The run time stack overflow checking is being disabled by default
because it is not safe for use with 2.6.29 and latter kernels.  These
kernels do now have their own stack overflow checking so this support
has become redundant anyway.  It can be re-enabled for older kernels or
arches without stack overflow checking by redefining CHECK_STACK().
2009-07-30 13:52:11 -07:00
Brian Behlendorf 6ae7fef5b9 Update global_page_state() support for 2.6.29 kernels.
Basically everything we need to monitor the global memory state of
the system is now cleanly available via global_page_state().  The
problem is that this interface is still fairly recent, and there
has been one change in the page state enum which we need to handle.
These changes basically boil down to the following:
- If global_page_state() is available we should use it.  Several
  autoconf checks have been added to detect the correct enum names.
- If global_page_state() is not available check to see if
  get_zone_counts() symbol is available and use that.
- If the get_zone_counts() symbol is not exported we have no choice
  be to dynamically aquire it at load time.  This is an absolute
  last resort for old kernel which we don't want to patch to
  cleanly export the symbol.
2009-07-28 15:06:42 -07:00
Brian Behlendorf 6b09f73939 Remove get/put_task_struct as they are not available for SLES11
This interface is going away, and it's not as if most callers actually
use crhold/crfree when working with credentials.  So it'll be okay
they we're not taking a reference on the task structure the odds of
it going away while working with a credential and pretty small.
2009-07-28 15:04:21 -07:00
Brian Behlendorf ec7d53e99a Add basic credential support and splat tests.
The previous credential implementation simply provided the needed types and
a couple of dummy functions needed.  This update correctly ties the basic
Solaris credential API in to one of two Linux kernel APIs.

Prior to 2.6.29 the linux kernel embeded all credentials in the task
structure.  For these kernels, we pass around the entire task struct as if
it were the credential, then we use the helper functions to extract the
credential related bits.

As of 2.6.29 a new credential type was added which we can and do fairly
cleanly layer on top of.  Once again the helper functions nicely hide
the implementation details from all callers.

Three tests were added to the splat test framework to verify basic
correctness.  They should be extended as needed when need credential
functions are added.
2009-07-27 17:18:59 -07:00
Brian Behlendorf 3d0cb2d31d Remove LINUXINCLUDE from autoconf wrapper, breaks 2.6.28+ kernels.
Modern kernel build systems at least post 2.6.16 will set this properly
so we should not.  In fact post 2.6.28 the include headers have moved
under arch so the guess we make here is completely wrong.  Letting
the kernel build system set this ensure it will be correct.
2009-07-27 09:52:01 -07:00
Brian Behlendorf 7064b767c2 Positive Solaris ioctl return codes need to be negated for use by libc 2009-07-23 16:14:52 -07:00
Brian Behlendorf 3c9ce2bf69 Allow kmem or vmem based slab for slab_lock and slab_overcommit tests.
The slab_overcommit test case could hang on a system with fragmented
memory because it was creating a kmem based slab with 256K objects.
To avoid this I've removed the KMC_KMEM flag which allows the slab
to decide if it should be kmem or vmem backed based on the object
side.  The slab_lock test shares this code and will also be effected.
But the point of these two tests is to stress cache locking and
memory overcommit, the type of slab is not critical.  In fact, allowing
the slab to do the default smart thing is preferable.
2009-07-23 13:50:53 -07:00
Brian Behlendorf 2141116167 The HAVE_PATH_IN_NAMEIDATA compat macros should have been used here. 2009-07-22 14:28:19 -07:00
Brian Behlendorf 749e5eb1ed Check arch/default/ path when detecting kernel objects on SLES
We still preferentially use arch/arch looking for a native version
but if that fails it is acceptable to use default.
2009-07-22 06:59:28 -07:00
Brian Behlendorf 78d6de97bd Register a basic compat ioctl handler (32 vs 64 bit compat)
Simply pass the ioctl on to the normal handler.  If the ioctl
helper macros are used correctly this should be safe as they
will handle the packing/unpacking of the data encoded in the
ioctl command.  And actually, if the caller does not use the
IO* macros at all, and just passes small values, it will probably
be OK as well.  We only get in to trouble if they try and use
the upper 32-bits.  Endianness is not really a concern here, we
we are pretty much assumed they user and kernel will match.
2009-07-21 10:13:58 -07:00
Ricardo M. Correia ac95d0974b Fixed NULL dereference by tcd_for_each() when the kmalloc() call in module/spl/spl-debug.c:1163 returns NULL.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2009-07-14 15:24:59 -07:00
Ricardo M. Correia e004f04c8b Prevent integer overflow after ~164 days of uptime.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2009-07-14 15:23:25 -07:00
Brian Behlendorf b11b08ed64 Add a little paranoia here to ensure endianess is set correctly. 2009-07-14 14:28:04 -07:00
Brian Behlendorf 06dea10380 Add basic groupmember() function, not sup groups. 2009-07-10 10:58:06 -07:00
Brian Behlendorf d3126abe75 Add ddi_copyin/ddi_copyout support for fake kernel originated ioctls. 2009-07-10 10:56:32 -07:00
Brian Behlendorf 2a734e9c26 Define ACE_ALL_PERMS for use by ACLs 2009-07-09 15:00:25 -07:00
Brian Behlendorf c18cbcfe66 Define FKIOCTL which is used on Solaris to mark an in-kernel ioctl. 2009-07-09 14:59:41 -07:00
Brian Behlendorf 3a68dc5374 Add ASSERTV macro to simplify removing variables (the V in ASSERTV)
which are only used in ASSERT().
2009-07-09 12:15:23 -07:00
Brian Behlendorf 915404bd50 Add basic support for TASKQ_THREADS_CPU_PCT taskq flag which is
used to scale the number of threads based on the number of online
CPUs.  As CPUs are added/removed we should rescale the thread
count appropriately, but currently this is only done at create.
2009-07-09 10:07:52 -07:00
Brian Behlendorf aaad2f7226 Update ChangeLog 2009-07-02 14:19:11 -07:00
Brian Behlendorf bb339d0670 Cleanly handle --with-linux=NONE option when used to generate source
rpms.  These should not be fatal because we actually don't need them
until we build the source rpm.  When doing mock builds this is
important because these dependent rpms will only be installed if
they are specificed in the source rpms spec file.
2009-07-02 10:47:28 -07:00
Brian Behlendorf 86933a6e51 Simplify rpm build rules, added config/rpm.am.
Distro friendly changes such that the kernel modules are packaged seperately.
2009-07-01 14:37:44 -07:00
Brian Behlendorf 5c3c70adec Add spl.release to spl-devel to simply dependent package version check. 2009-06-29 16:41:21 -07:00
Brian Behlendorf f4f9cd75a1 Install spl-devel products in /usr/src/spl-SPL_VERSION/LINUX_VERSION/
Remove the spl symlink, it's just confusing
2009-06-26 16:30:44 -07:00
Brian Behlendorf c0517c35d2 Use do_div on older kernel where do_div64 doesn't exist. 2009-06-26 13:10:52 -07:00
Brian Behlendorf 155189d4a7 Additional tuning to get the BuildRequires right for all cases.
pl.spec~
2009-06-26 12:43:27 -07:00
Brian Behlendorf ac12b26284 Simplify the kernel depenency logic 2009-06-26 11:37:06 -07:00
Brian Behlendorf af971a8594 Spec file update, for some reason the following shorthand syntax
was failing so it was replaced with the longer %if version.

%{!?foo: %define foo bar}

changed to

%if %{undefined foo}
 %define foo bar
%endif
2009-06-26 10:34:40 -07:00
Brian Behlendorf e28bc9160d SRPM build farm / mock itergration 2009-06-26 09:40:14 -07:00
Brian Behlendorf 07114bdee9 Build farm integration to ensure BuildRequires are correct 2009-06-25 16:11:13 -07:00
Brian Behlendorf 31b2e0b070 Packaging Fixes
- Kernel modules should be built using the LINUX_OBJ Makefiles and
  not the LINUX Makefiles to ensure the proper install paths are used.
- Install modules in to addon/spl/
- Ensure no additional kernel module build products are packaged.
- Simplified spl.spec.in which supports RHEL, CHAOS, SLES, FEDORA.
2009-06-25 15:31:53 -07:00
Brian Behlendorf 762b96f6c6 Update ChangeLog with a high level summary of the changes from
0.4.3 to 0.4.4 prior to tagging.  Full details can be found in
the git commit history.
2009-06-22 15:31:40 -07:00
Brian Behlendorf 2e0e7e6976 Packaging improvements for RHEL and SLES (part 2)
- Allow checking for exported symbols in both Module.symvers
  and Module.symvers.  My stock SLES kernel ships an objects
  directory with Module.symvers, yet produces a Module.symvers
  in the local build directory.
2009-06-16 11:34:28 -07:00
Brian Behlendorf 39a3d2a421 Packaging improvements for RHEL and SLES
- Properly honor --prefix in build system and rpm spec file.
- Add '--define require_kdir' to spec file to support building
  rpms against kernel sources installed in non-default locations.
- Add '--define require_kobj' to spec file to support building
  rpms against kernel object installed in non-default locations.
- Stop suppressing errors in autogen.sh script.
- Improved logic to detect missing kernel objects when they are
  not located with the source.  This is the common case for SLES
  as well as in-tree chaos kernel builds and is done to simply
  support for multiple arches.
- Moved spl-devel build products to /usr/src/spl-<version>, a
  spl symlink is created to reference the last installed version.
2009-06-16 10:44:59 -07:00
Brian Behlendorf e554dffa60 SLES10 Fixes (part 9)
- Proper ioctl() 32/64-bit binary compatibility.  We need to ensure the
  ioctl data itself is always packed the same for 32/64-bit binaries.
  Additionally, the correct thing to do is encode this size in bytes
  as part of the command using _IOC_SIZE().
- Minor formatting changes to respect the 80 character limit.
- Move all SPLAT_SUBSYSTEM_* defines in to splat-ctl.h.
- Increase SPLAT_SUBSYSTEM_UNKNOWN because we were getting close
  to accidentally using it for a real registered subsystem.
2009-05-21 10:56:11 -07:00
Brian Behlendorf 9593ef76d9 SLES10 Fixes (part 8)
- Add compat_ioctl() handler, by default 64-bit SLES systems build 32-bit
  ELF binaries.  For the 32-bit binaries to pass ioctl information to a
  64-bit kernel a compatibility handler needs to be registered.  In our
  case no additional conversions are needed to convert 32-bit ioctl()
  commands to 64-bit commands so we can just call the default handler.
2009-05-20 16:33:08 -07:00
Brian Behlendorf 124ca8a5a9 SLES10 Fixes (part 7)
- Initial SLES testing uncovered a long standing bug in the debug
  tracing.  The tcd_for_each() macro expected a NULL to terminate
  the trace_data[i] array but this was only ever true due to luck.
  All trace_data[] iterators are now properly capped by TCD_TYPE_MAX.
- SPLAT_MAJOR 229 conflicted with a 'hvc' device on my SLES system.
  Since this was always an arbitrary choice I picked something else.
- The HAVE_PGDAT_LIST case should set pgdat_list_addr to the value stored
  at the address of the memory location returned by kallsyms_lookup_name().
2009-05-20 15:30:13 -07:00
Brian Behlendorf 5232d256b4 SLES10 Fixes (part 6)
- Prior to 2.6.17 there were no *_pgdat helper functions in mm/mmzone.c.
  Instead for_each_zone() operated directly on pgdat_list which may or
  may not have been exported depending on how your kernel was compiled.
  Now new configure checks determine if you have the helpers or not, and
  if the needed symbols are exported.  If they are not exported then they
  are dynamically aquired at runtime by kallsyms_lookup_name().
2009-05-20 14:23:13 -07:00
Brian Behlendorf 3731931529 Powerpc Fixes (part 1):
- Enable builds for powerpc ISA type.
- Add DIV_ROUND_UP and roundup macros if unavailable.
- Cast 64-bit values for %lld format string to (long long) to
  quiet compile warning.
2009-05-20 12:23:24 -07:00
Brian Behlendorf fe4573928f SLES10 Fixes (part 5):
- Fix incorrect mapping for spl_device_create()->class_device_create()
  which is the prefered API for 2.6.13 to 2.6.17 based kernels.
2009-05-20 11:54:40 -07:00
Brian Behlendorf a093c6a499 SLES10 Fixes (part 4):
- Configure check for SLES specific API change to vfs_unlink()
  and vfs_rename() which added a 'struct vfsmount *' argument.
  This was for something called the linux-security-module, but
  it appears that it was never adopted upstream.
2009-05-20 11:31:55 -07:00
Brian Behlendorf 6c9433c150 SLES10 Fixes (part 3):
- Configure check for mutex_lock_nested().  This function was introduced
  as part of the mutex validator in 2.6.18, but if it's unavailable then
  it's safe to fallback to a plain mutex_lock().
2009-05-20 11:00:39 -07:00
Brian Behlendorf 96dded3844 SLES10 Fixes (part 2):
- Configure check, the div64_64() function was renamed to
  div64_u64() as of 2.6.26.
- Configure check, the global_page_state() fuction was introduced
  in 2.6.18 kernels.  The earlier 2.6.16 based SLES10 must not try
  and use it, thankfully get_zone_counts() is still available.
- To simplify debugging poison all symbols aquired dynamically
  using spl_kallsyms_lookup_name() with SYMBOL_POISON.
- Add console messages when the user mode helpers fail.
- spl_kmem_init_globals() use bit shifts instead of division.
- When the monotonic clock is unavailable __gethrtime() must perform
  the HZ division as an 'unsigned long long' because the SPL only
  implements __udivdi3(), and not __divdi3() for 'long long' division
  on 32-bit arches.
2009-05-20 10:08:37 -07:00
Brian Behlendorf bf338d8d09 SLES10 Fixes (part 1):
- Exclude -obj when detecting installed kernel source.
- Detect -obj directory for out of tree kernel builds.
- Allow kernel build system to set CC to ensure -m64 is set properly.
  This is an issue on 64-bit SLES systems which by default always
  build 32-bit binaries (unlike RHEL/Fedora which default to 64-bit)
2009-05-19 11:42:39 -07:00
Brian Behlendorf f8b2932a43 Prep for spl-0.4.3 tag. 2009-03-20 14:48:30 -07:00
Brian Behlendorf 759dfe7d43 Add list_move_tail() function. 2009-03-19 21:40:07 -07:00
Brian Behlendorf c388a3ab26 Remove useless EOL white space padding from splat -l command. 2009-03-18 11:56:42 -07:00
Brian Behlendorf f250d90b5f Fix vmem leak in kmem_cache_test (missing splat_kmem_cache_test_kcp_free()) 2009-03-18 11:56:00 -07:00
Brian Behlendorf 0cbaeb117a Allow spl_config.h to be included by dependant packages
We need dependent packages to be able to include spl_config.h so they
can leverage the configure checks the SPL has done.  This is important
because several of the spl headers need the results of these checks to
work properly.  Unfortunately, the autoheader build product is always
private to a particular build and defined certain common things.
(PACKAGE, VERSION, etc).  This prevents other packages which also use
autoheader from being include because the definitions conflict.  To
avoid this problem the SPL build system leverage AH_BOTTOM to include
a spl_unconfig.h at the botton of the autoheader build product.  This
custom include undefs all known shared symbols to prevent the confict.
This does however mean that those definition are also not availble
to the SPL package either.  The SPL package therefore uses the
equivilant SPL_META_* definitions.
2009-03-17 14:55:59 -07:00
Brian Behlendorf e11d6c5f50 FC10/i686 Compatibility Update (2.6.27.19-170.2.35.fc10.i686)
In the interests of portability I have added a FC10/i686 box to
my list of development platforms.  The hope is this will allow me
to keep current with upstream kernel API changes, and at the same
time ensure I don't accidentally break x86 support.  This patch
resolves all remaining issues observed under that environment.

1) SPL_AC_ZONE_STAT_ITEM_FIA autoconf check added.  As of 2.6.21
the kernel added a clean API for modules to get the global count
for free, inactive, and active pages.  The SPL attempts to detect
if this API is available and directly map spl_global_page_state()
to global_page_state().  If the full API is not available then
spl_global_page_state() is implemented as a thin layer to get
these values via get_zone_counts() if that symbol is available.

2) New kmem:vmem_size regression test added to validate correct
vmem_size() functionality.  The test case acquires the current
global vmem state, allocates from the vmem region, then verifies
the allocation is correctly reflected in the vmem_size() stats.

3) Change splat_kmem_cache_thread_test() to always use KMC_KMEM
based memory.  On x86 systems with limited virtual address space
failures resulted due to exhaustig the address space.  The tests
really need to problem exhausting all memory on the system thus
we need to use the physical address space.

4) Change kmem:slab_lock to cap it's memory usage at availrmem
instead of using the native linux nr_free_pages().  This provides
additional test coverage of the SPL Linux VM integration.

5) Change kmem:slab_overcommit to perform allocation of 256K
instead of 1M.  On x86 based systems it is not possible to create
a kmem backed slab with entires of that size.  To compensate for
this the number of allocations performed in increased by 4x.

6) Additional autoconf documentation for proposed upstream API
changes to make additional symbols available to modules.

7) Console error messages added when spl_kallsyms_lookup_name()
fails to locate an expected symbol.  This causes the module to fail
to load and we need to know exactly which symbol was not available.
2009-03-17 12:16:31 -07:00
Brian Behlendorf 7257ec4185 Fix taskq_wait() not waiting bug
I'm very surprised this has not surfaced until now.  But the taskq_wait()
implementation work only wait successfully the first time it was called.
Subsequent usage of taskq_wait() on the taskq would not wait.

The issue was caused by tq->tq_lowest_id being set to MAX_INT after the
first wait completed.  This caused subsequent waits which check that the
waiting id is less than the lowest taskq id to always succeed.  The fix
is to ensure that tq->tq_lowest_id is never set larger than tq->tq_next.id.

Additional fixes which were added to this patch include:
1) Fix a race by placing the taskq_wait_check() in the tq->tq_lock spinlock.
2) taskq_wait() should wait for the largest outstanding id.
3) Multiple spelling corrections.
4) Added taskq wait regression test to validate correct behavior.
2009-03-15 15:13:49 -07:00
Brian Behlendorf 5b5f568503 Mutex tests updated to use task queues instead of work queues.
Mainly for portability reasons I have rebased the mutex tests on Solaris
taskqs instead of linux work queues.  The linux workqueue API changed post
2.6.18 kernels and using task queues avoids having to conditionally detect
which workqueue API to use.

Additionally, this is basically free additional testing for the task queues.
Much to my surprise after updating these test cases they did expose a long
standing bug in the taskq_wait() implementation.  This patch does not
address that issue but the followup patch does.
2009-03-15 15:05:38 -07:00
Brian Behlendorf 8123ac4f0d Added SPL_AC_5ARGS_DEVICE_CREATE autoconf configure check
As of 2.6.27 kernels the device_create() API changed to include
a private data argument.  This check detects which version of
device_create() function the kernel has and properly defines
spl_device_create() to use the correct prototype.
2009-03-13 13:38:43 -07:00
Ricardo M. Correia a0b5ae8aca Fix off-by-1 truncation of hw_serial when converting from integer to string, when writing to /proc/sys/kernel/spl/spl_hostid.
Fixes hostid mismatch which leads to assertion failure when the hostid/hw_serial is a 10-character decimal number:

$ zpool status
  pool: lustre
 state: ONLINE
lt-zpool: zpool_main.c:3176: status_callback: Assertion `reason == ZPOOL_STATUS_OK' failed.
zsh: 5262 abort      zpool status
2009-03-12 15:47:50 -07:00
Ricardo M. Correia 6c33eb8162 Minor bug fix in XDR code introduced in last minute change before landing.
1) Removed xdr_bytesrec typedef which has no consumers.  If we re-add
   it should also probably be xdr_bytesrec_t.
2009-03-11 16:27:35 -07:00
Ricardo M. Correia f48b61938a Add XDR implementation
Added proper XDR implementation (Lustre bug 17662), needed for on-disk
compatibility between platforms of different endianness.
2009-03-11 13:00:26 -07:00
Brian Behlendorf 0c617c9a63 Build system cleanup
1) Undefine non-unique entries in spl_config.h
2) Minor Makefile cleanup
3) Don't use includedir for proper kernel header install
2009-03-11 12:37:34 -07:00
Brian Behlendorf d4326403de Build System Default Kernel
Update the method used for determining which kernel to build against
when not specified.  Previous 'uname -r' was used but this makes the
assumption that the running kernel is the one you want to use, this is
often not the case.  It is better to examine the usual kernel-devel
install locations and select one of the installed kernels.
2009-03-09 16:57:10 -07:00
Brian Behlendorf c5f704607b Build system and packaging (RPM support)
An update to the build system to properly support all commonly
used Makefile targets these include:

  make all        # Build everything
  make install    # Install everything
  make clean	  # Clean up build products
  make distclean  # Clean up everything
  make dist       # Create package tarball
  make srpm       # Create package source RPM
  make rpm        # Create package binary RPMs
  make tags       # Create ctags and etags for everything

Extra care was taken to ensure that the source RPMs are fully
rebuildable against Fedora/RHEL/Chaos kernels.  To build binary
RPMs from the source RPM for your system simply run:

  rpmbuild --rebuild spl-x.y.z-1.src.rpm

This will produce two binary RPMs with correct 'requires'
dependencies for your kernel.  One will contain all spl modules
and support utilities, the other is a devel package for compiling
additional kernel modules which are dependant on the spl.

  spl-x.y.z-1_<kernel version>.x86_64.rpm
  spl-devel-x.y.2-1_<kernel version>.x86_64.rpm
2009-03-09 15:56:55 -07:00
Ricardo M. Correia 32f74c5280 XXX: Temporarily disable vmem_size(). 2009-03-05 10:13:59 -08:00
Brian Behlendorf 04fa349d69 Merge branch 'kallsyms' 2009-03-04 10:19:41 -08:00
Brian Behlendorf d1ff2312b0 Linux VM Integration Cleanup
Remove all instances of functions being reimplemented in the SPL.
When the prototypes are available in the linux headers but the
function address itself is not exported use kallsyms_lookup_name()
to find the address.  The function name itself can them become a
define which calls a function pointer.  This is preferable to
reimplementing the function in the SPL because it ensures we get
the correct version of the function for the running kernel.  This
is actually pretty safe because the prototype is defined in the
headers so we know we are calling the function properly.

This patch also includes a rhel5 kernel patch we exports the needed
symbols so we don't need to use kallsyms_lookup_name().  There are
autoconf checks to detect if the symbol is exported and if so to
use it directly.  We should add patches for stock upstream kernels
as needed if for no other reason than so we can easily track which
additional symbols we needed exported.  Those patches can also be
used by anyone willing to rebuild their kernel, but this should
not be a requirement.  The rhel5 version of the export-symbols
patch has been applied to the chaos kernel.

Additional fixes:
1) Implement vmem_size() function using get_vmalloc_info()
2) SPL_CHECK_SYMBOL_EXPORT macro updated to use $LINUX_OBJ instead
   of $LINUX because Module.symvers is a build product.  When
   $LINUX_OBJ != $LINUX we will not properly detect exported symbols.
3) SPL_LINUX_COMPILE_IFELSE macro updated to add include2 and
   $LINUX/include search paths to allow proper compilation when
   the kernel target build directory is not the source directory.
2009-03-04 10:04:15 -08:00
Ricardo M. Correia eb7c7f44e8 Changed ptob()/btop() mult/div into bit shifts.
Added necessary include for PAGE_SHIFT.
2009-02-25 15:50:58 -08:00
Ricardo M. Correia 7819a92a9b Added btop() and moved ptob() to include/sys/param.h. 2009-02-25 15:50:50 -08:00
Ricardo M. Correia 4327ac3ff9 Changed z_compress_level() and z_uncompress() prototypes to match the ones in Solaris.
Fixes compilation warning.
2009-02-23 11:45:59 -08:00
Brian Behlendorf a1cf80b493 Matching kmem_free() fix for use after free case.
See commit bb01879ebe for a full
description.  This issue should have been addressed in the same
commit but it slipped my mind.
2009-02-19 12:28:10 -08:00
Brian Behlendorf 99639e4a13 Add zone_get_hostid() function
Minimal support added for the zone_get_hostid() function.  Only
global zones are supported therefore this function must be called
with a NULL argumment.  Additionally, I've added the HW_HOSTID_LEN
define and updated all instances where a hard coded magic value
of 11 was used; "A good riddance of bad rubbish!"
2009-02-19 11:26:17 -08:00
Brian Behlendorf 63a93055fb Coverity 9657: Resource Leak
Accidentally leaked list item li in error path.  The fix is to
adjust this error path to ensure the allocated list item which
has not yet been added to the list gets freed.  To do this we
simply add a new goto label slightly earlier to use the existing
cleanup logic and minimize the number of unique return points.
2009-02-18 10:16:26 -08:00
Brian Behlendorf 02c7f16494 Coverity 9656: Forward NULL
This was a false positive the callpath being walked is impossible
because the splat_kmem_cache_test_kcp_alloc() function will ensure
kcp->kcp_kcd[0] is initialized to NULL.  However, there is no harm
is making this explicit for the test case so I'm adding a line to
clearly set it to correct the analysis.
2009-02-18 10:09:01 -08:00
Brian Behlendorf 1315c88437 Coverity 9649, 9650, 9651: Uninit
This check was originally added to detect double initializations
of mutex types (which it did find).  Unfortunately, Coverity is
right that there is a very small chance we could trigger the
assertion by accident because an uninitialized stack variable
happens to contain the mutex magic.  This is particularly unlikely
since we do poison the mutexs when destroyed but still possible.
Therefore I'm simply removing the assertion.
2009-02-18 09:48:07 -08:00
Brian Behlendorf bb01879ebe Coverity 9654, 9654: Use After Free
Because vmem_free() was implemented as a macro using the ','
operator to evaluate both arguments and we performed the free
before evaluating size we would deference the free'd pointer.
To resolve the problem we just invert the ordering and evaluate
size first just as if it was evaluated by the caller when being
passed to this function.  This ensure that if the caller is
doing something reckless like performing an assignment as
part of the size argument we still perform it and it simply
doesn't get removed by the macro.  Oh course nobody should
be doing this sort of thing, but just in case.
2009-02-17 16:51:19 -08:00
Brian Behlendorf 15dc8b072e Coverity 9652, 9653: No Effect
Removed 2 ASSERT()s which had no effect because by definition
size_t is always an unsigned type thus is always >= 0.
2009-02-17 16:30:58 -08:00
Brian Behlendorf 014b1d6f54 Coverity 9641: Buffer Size
When SPLAT_TEST_INIT() initialized SPLAT_KMEM_TEST11_NAME the short
short test name overran the static length buffer of SPLAT_NAME_SIZE.
This was fixed by increasing the buffer length from 16 to 20 bytes.
2009-02-17 16:24:26 -08:00
Brian Behlendorf 9b1b8e4c24 kmem slab magazine ageing deadlock
- The previous magazine ageing sceme relied on the on_each_cpu()
  function to call spl_magazine_age() on each cpu.  It turns out
  this could deadlock with do_flush_tlb_all() which also relies
  on the IPI based on_each_cpu().  To avoid this problem a per-
  magazine delayed work item is created and indepentantly
  scheduled to the correct cpu removing the need for on_each_cpu().
- Additionally two unused fields were removed from the type
  spl_kmem_cache_t, they were hold overs from previous cleanup.
    - struct work_struct work
    - struct timer_list timer
2009-02-17 15:52:18 -08:00
Brian Behlendorf 1a944a7d0b kmem slab fixes
- spl_slab_reclaim() 'continue' changed back to 'break' from commit
  37db7d8cf9.  The original was correct,
  I have added a comment to ensure this does not happen again.
- spl_slab_reclaim() further optimized by moving the destructor call
  in spl_slab_free() outside the skc->skc_lock.  This minimizes the
  length of time the spin lock is held, allows the destructors to
  be invoked concurrently for different objects, and as a bonus makes
  it safe (although unwise) to sleep in the destructors.
2009-02-13 10:28:55 -08:00
Brian Behlendorf fce5ef8306 Build system update
- Added default build flags for kernel modules:
  -Wstrict-prototypes -Werror
2009-02-12 15:04:36 -08:00
Brian Behlendorf f6c5d4ff88 Build system update
- Added default build flags:
  -Wall -Wstrict-prototypes -Werror -Wshadow
- Added missing Makefile's for include/ subdirectories.
2009-02-12 14:45:22 -08:00
Brian Behlendorf 37db7d8cf9 kmem slab fixes
- Default SPL_KMEM_CACHE_DELAY changed to 15 to match Solaris.
- Aged out slab checking occurs every SPL_KMEM_CACHE_DELAY / 3.
- skc->skc_reap tunable added whichs allows callers of
  spl_slab_reclaim() to cap the number of slabs reclaimed.
  On Solaris all eligible slabs are always reclaimed, and this
  is still the default behavior.  However, I suspect that is
  not always wise for reasons such as in the next comment.
- spl_slab_reclaim() added cond_resched() while walking the
  slab/object free lists.  Soft lockups were observed when
  freeing large numbers of vmalloc'd slabs/objets.
- spl_slab_reclaim() 'sks->sks_ref > 0' check changes from
  incorrect 'break' to 'continue' to ensure all slabs are
  checked.
- spl_cache_age() reworked to avoid a deadlock with
  do_flush_tlb_all() which occured because we slept waiting
  for completion in spl_cache_age().  To waiting for magazine
  reclamation to finish is not required so we no longer wait.
- spl_magazine_create() and spl_magazine_destroy() shifted
  back to using for_each_online_cpu() instead of the
  spl_on_each_cpu() approach which was of course a bad idea
  due to memory allocations which Ricardo pointed out.
2009-02-12 13:32:10 -08:00
Ricardo M. Correia f500ccff35 Minor bug fix due to MAXOFFSET_T constant being too large on 32-bit systems. 2009-02-07 00:53:39 +00:00
Brian Behlendorf e50ad76da5 Prep for 0.4.2 tag 2009-02-05 13:43:45 -08:00
Brian Behlendorf 4ab13d3b5c Additional Linux VM integration
Added support for Solaris swapfs_minfree, and swapfs_reserve tunables.
In additional availrmem is now available and return a reasonable value
which is reasonably analogous to the Solaris meaning.  On linux we
return the sun of free and inactive pages since these are all easily
reclaimable.

All tunables are available in /proc/sys/kernel/spl/vm/* and they may
need a little adjusting once we observe the real behavior.  Some of
the defaults are mapped to similar linux counterparts, others are
straight from the OpenSolaris defaults.
2009-02-05 12:26:34 -08:00
Brian Behlendorf 36b313dacf Linux VM integration / device special files
Support added to provide reasonable values for the global Solaris
VM variables: minfree, desfree, lotsfree, needfree.  These values
are set to the sum of their per-zone linux counterparts which
should be close enough for Solaris consumers.

When a non-GPL app links against the SPL we cannot use the udev
interfaces, which means non of the device special files are created.
Because of this I had added a poor mans udev which cause the SPL
to invoke an upcall and create the basic devices when a minor
is registered.  When a minor is unregistered we use the vnode
interface to unlink the special file.
2009-02-04 15:15:41 -08:00
Brian Behlendorf 31a033ecd4 2.6.27+ portability changes
- Added SPL_AC_3ARGS_ON_EACH_CPU configure check to determine
  if the older 4 argument version of on_each_cpu() should be
  used or the new 3 argument version.  The retry argument was
  dropped in the new API which was never used anyway.
- Updated work queue compatibility wrappers.  The old way this
  worked was to pass a data point when initialized the workqueue.
  The new API assumed the work item is embedding in a structure
  and we us container_of() to find that data pointer.
- Updated skc->skc_flags to be an unsigned long which is now
  type checked in the bit operations.  This silences the warnings.
- Updated autogen products and splat tests accordingly
2009-02-02 15:12:30 -08:00
Brian Behlendorf f220894e1f Make the number of system taskq threads based on the node of cores in the node, as is done for most linux system tasks 2009-02-02 08:53:53 -08:00
Brian Behlendorf 10a4be0f03 Update thread tests to have max_time 2009-01-30 21:24:42 -08:00
Brian Behlendorf 416bae036b Add new workqueue header 2009-01-30 21:11:42 -08:00
Brian Behlendorf ea3e6ca9e5 kmem_cache hardening and performance improvements
- Added slab work queue task which gradually ages and free's slabs
  from the cache which have not been used recently.
- Optimized slab packing algorithm to ensure each slab contains the
  maximum number of objects without create to large a slab.
- Fix deadlock, we can never call kv_free() under the skc_lock.  We
  now unlink the objects and slabs from the cache itself and attach
  them to a private work list.  The contents of the list are then
  subsequently freed outside the spin lock.
- Move magazine create/destroy operation on to local cpu.
- Further performace optimizations by minimize the usage of the large
  per-cache skc_lock.  This includes the addition of KMC_BIT_REAPING
  bit mask which is used to prevent concurrent reaping, and to defer
  new slab creation when reaping is occuring.
- Add KMC_BIT_DESTROYING bit mask which is set when the cache is being
  destroyed, this is used to catch any task accessing the cache while
  it is being destroyed.
- Add comments to all the functions and additional comments to try
  and make everything as clear as possible.
- Major cleanup and additions to the SPLAT kmem tests to more
  rigerously stress the cache implementation and look for any problems.
  This includes correctness and performance tests.
- Updated portable work queue interfaces
2009-01-30 20:54:49 -08:00
Brian Behlendorf 34e71c9e97 Remove debug check was was accidentally left in place an prevent the slab cache from working on systems with >4 cores 2009-01-26 20:10:23 -08:00
Brian Behlendorf 0f233eac33 Pull the blkdev header in to the sunldi for some useful structure definitions and helper functions 2009-01-26 16:47:49 -08:00
Brian Behlendorf 48e0606a52 Implement kmem cache alignment argument 2009-01-26 09:02:04 -08:00
Brian Behlendorf e4f3ea278e Remove stray ` from macro 2009-01-23 08:59:11 -08:00
Brian Behlendorf 3f4126739d Sleep uninteruptibly, waking up early may result in a crash 2009-01-22 09:58:48 -08:00
Brian Behlendorf 511176398c Update debug.h to standardize VERIFY3_IMPL error messages in debug and non-debug mode 2009-01-22 09:41:47 -08:00
Brian Behlendorf 064bbffb63 Prep for 0.4.1 tag 2009-01-21 11:46:02 -08:00
Brian Behlendorf b6b2acc66e Minor fix for compiler warning when KMEM_TRACKING is enabled 2009-01-20 13:39:35 -08:00
Brian Behlendorf ae3b87f908 KMEM_TRACKING turned up a missing free in list test 6, fix the leak 2009-01-20 12:47:53 -08:00
Brian Behlendorf 15270e003e Ensure -NDEBUG does not get added to spl_config.h and is only set in the build options. This allows other kernel modules to use spl_config to leverage the reset of the config checks without getting confused with the debug options 2009-01-20 11:59:47 -08:00
Brian Behlendorf 5566ec0959 Refresh libtool 2009-01-15 10:47:24 -08:00
Brian Behlendorf 617d5a673c Rename modules to module and update references 2009-01-15 10:44:54 -08:00
Brian Behlendorf f6a19c0d37 Make the splat load message caps just for consistency 2009-01-13 11:45:02 -08:00
Brian Behlendorf b172b6dfde TASKQ_DYNAMIC not yet support, do not create the global taskq with that flag or we crash with debug enabled. Also don't bother dumping debug when debugging is diabled, it's pointless 2009-01-13 11:43:05 -08:00
Brian Behlendorf b871b8cdef Rework ddi_strtox calls to a native implementation which actuall supports the EINVAL, ERANGE error handling, plus add a regression suite to ensure I got it atleast mostly right 2009-01-13 09:30:59 -08:00
Brian Behlendorf 1e4ed6c990 Add missing stub headers 2009-01-09 16:04:44 -08:00
Brian Behlendorf 121d48c97d Add basic ksid_lookupdomain and ksiddomain_rele support, just allocations 2009-01-09 15:30:53 -08:00
Brian Behlendorf f590d7d374 Make sure we export ddi_quiesce_not_needed 2009-01-09 14:30:30 -08:00
Brian Behlendorf 0e41414946 Add two new stub headers 2009-01-09 14:04:13 -08:00
Brian Behlendorf 97735c39e3 Add VOP_SEEK 2009-01-09 13:59:39 -08:00
Brian Behlendorf d83ba26e18 Add missing policy includes, add missing sun ddi bits 2009-01-09 10:49:47 -08:00
Brian Behlendorf 70997fb4b1 Add share.h stub 2009-01-09 10:06:18 -08:00
Brian Behlendorf 71c8ab9c68 Drat fix missing ; 2009-01-09 10:05:03 -08:00
Brian Behlendorf 23f5c4c281 Add missing callback_context_t and fid_t types 2009-01-09 10:03:37 -08:00
Brian Behlendorf 703e7a3cf4 Add stubs for three more includes 2009-01-09 09:47:27 -08:00
Brian Behlendorf 434d1d0f8f Add active test for splat list tests 2009-01-07 13:48:36 -08:00
Brian Behlendorf d702c04ff1 Add 5 splat tests for list handling 2009-01-07 12:54:03 -08:00
Brian Behlendorf 4c18c39ecb Add include/sys/compress.h header 2009-01-06 09:47:00 -08:00
Brian Behlendorf 160c63ab76 Add P2BOUNDARY macro 2009-01-06 09:23:13 -08:00
Brian Behlendorf 7adbea4141 Pull in some default page typedefs 2009-01-05 16:14:38 -08:00
Brian Behlendorf 0f37204417 Add DTRACE_PROBE(a) 2009-01-05 16:09:21 -08:00
Brian Behlendorf b53c565e65 Stub u8_textprep.h for inclusion purposes 2009-01-05 15:37:07 -08:00
Brian Behlendorf e9cb2b4f64 Add system taskq support 2009-01-05 15:08:03 -08:00
Brian Behlendorf 8a2b328b18 Remove u8_textprep, we will not be implementing this nightmare yet 2009-01-05 11:32:08 -08:00
Brian Behlendorf f3fc90c249 Include the header 2008-12-23 16:48:15 -08:00
Brian Behlendorf 925ca8cc01 Add sys/thread.h 2008-12-23 16:27:36 -08:00
Brian Behlendorf bb9cfc6cc3 Define needfree 2008-12-23 15:59:36 -08:00
Brian Behlendorf 2b88beb74f Add timer.h header 2008-12-23 15:40:20 -08:00
Brian Behlendorf bbdec3be06 Add u8 stub 2008-12-23 15:38:15 -08:00
Brian Behlendorf de79fdd3a8 Move sunddi include 2008-12-23 13:32:07 -08:00
Brian Behlendorf 9d457afd1b Add sunddi to uio 2008-12-23 13:30:04 -08:00
Brian Behlendorf dc0f920710 Minor updates 2008-12-23 13:25:52 -08:00
Brian Behlendorf 926e2b6058 Pull in lock types 2008-12-23 13:18:39 -08:00
Brian Behlendorf c1d42c2f1d Add header 2008-12-23 13:05:50 -08:00
Brian Behlendorf f5b92a66ad Add a few more missing header which the upstream stock kernel context expects 2008-12-23 13:03:09 -08:00
Brian Behlendorf 2ee63a549a Add struct ddi_strtox functions 2008-12-05 16:23:57 -08:00
Brian Behlendorf 857c63f413 Refresh 2008-12-05 16:20:09 -08:00
Brian Behlendorf 72e7de6026 Prefix META_ALIAS with SPL_ 2008-11-26 13:26:05 -08:00
Brian Behlendorf abc3ca149d Prefix all META_* #defines with SPL to prevent colisions which include our spl_config.h. Dependent packages may do this to leverage the autoconf check we have already run aganst the kernel. 2008-11-26 13:09:37 -08:00
Brian Behlendorf 895ff83c08 Rebase on Git from SVN as of version 0.4.0. 2008-11-26 09:57:29 -08:00
behlendo 7860010a72 Tag spl-0.3.5
git-svn-id: https://outreach.scidac.gov/svn/spl/tags/spl-0.3.5@184 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-26 17:44:40 +00:00
behlendo 02a794ae3f Add libtool script
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@183 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-26 17:43:44 +00:00
behlendo 7212e2cd27 Add missing autogen products
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@182 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-26 17:07:59 +00:00
behlendo dd529a30ac Include META file support.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@181 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-26 17:01:45 +00:00
behlendo bf9f3bac95 * : Add autogen.sh products.
* configure.ac : Use AC_CONFIG_AUX_DIR to put autoconf products
in ./auotconf.

* autogen.sh : Use --copy to avoid symlinks, remove error
redirection, run aclocal before libtoolize.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@180 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-24 23:49:18 +00:00
behlendo 6a1c3d418a * include/sys/sunddi.h, modules/spl/spl-module.c : Removed default
udev support from sunddi implementation because it uses GPL-only
symbols.  This support is optionally available for SPL consumers
if they define HAVE_GPL_ONLY_SYMBOLS and license their module as
GPL using the MODULE_LICENSE("GPL") macro.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@179 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-13 21:43:30 +00:00
behlendo 5457aee1a3 Prep for spl-0.3.4 tag.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@177 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-06 00:51:31 +00:00
behlendo 83e5edb47d Remove 3 instances of unused variables.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@176 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-05 22:06:56 +00:00
behlendo 7ea1cbf5b2 Add proper error handling for the case where a thread can not be
created.  Instead of asserting we simply abort the test, wait for
any tasks we created to finish, and return -ESRCH back to the user
space component.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@175 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-05 21:43:37 +00:00
behlendo 36833ea4e4 Slightly increase SPLAT_NAME_SIZE to ensure string is always
NULL terminated.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@174 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-05 21:27:31 +00:00
behlendo ac569b72a1 Fix a small corner case in the test infrastructure where
we might end up with a non-NULL terminated buffer if the 
test name or desc is too long.  Only copy N-1 bytes.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@173 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-04 23:38:29 +00:00
behlendo 12018327f5 3 minor fixups where sprintf() was used instead of snprintf() with
a known max length.  Additionally the function return value is cast
to void to make it explicit that the value is not needed.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@172 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-04 23:30:15 +00:00
behlendo 0498e6c585 Removed useless check
Fix forward NULL in splat kmem_cache test ctors/dtors



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@171 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-04 23:18:31 +00:00
behlendo 3bc9d50eaa Add missing error handling to this case where a memory allocation fails.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@170 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-04 22:51:31 +00:00
behlendo 8e80a04c04 Simple ignore the return type which was never used here and cast it to void.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@169 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-04 22:42:58 +00:00
behlendo 55c59e6153 Add proper error handling to one of the atomic test cases in the event
that a kernel thread cannot be properly spawned.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@168 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-04 22:39:29 +00:00
behlendo b02e9b2415 Add missing initializer which is needed in an unlikely error case.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@167 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-04 22:24:55 +00:00
behlendo b07335c1a7 Ensure GPL-only symbols are re-exported as GPL-only
Remove GPL-only symbol from __gethrtime()



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@166 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-04 00:00:16 +00:00
behlendo c8e60837b7 * spl-09-fix-kmem-track-oops.patch
This fixes an oops when unloading the modules, in the case where memory
tracking was enabled and there were memory leaks. The comment in the
code explains what was the problem.

* spl-10-fix-assert-verify-ndebug.patch
This fixes ASSERT*() and VERIFY*() macros in non-debug builds. VERIFY*()
macros are supposed to check the condition and panic even in production
builds, and ASSERT*() macros don't need to evaluate the arguments.
Also some 32-bit fixes.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@165 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-03 22:02:15 +00:00
behlendo c22e7a427b Under Solaris KM_SLEEP ensures success (or at least you hang forever).
That said when working with a finite resource like memory failure really
is always a possibility.  It would be far better longer term if the ZFS
code could be weened off this assumption and properly handle the cases
where an allocation fails.  Still I've applied the patch to spl-0.3.4
since this layer is supposed to emulate Solaris as closely as possible.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@164 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-03 21:51:33 +00:00
behlendo a0f6da3d95 Add a SPL_AC_TYPE_ATOMIC64_T test to configure for systems which do
already supprt atomic64_t types.

* spl-07-kmem-cleanup.patch
This moves all the debugging code from sys/kmem.h to spl-kmem.c, because
the huge macros were hard to debug and were bloating functions that
allocated memory. I also fixed some other minor problems, including
32-bit fixes and a reported memory leak which was just due to using the
wrong free function.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@163 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-03 21:06:04 +00:00
behlendo 550f170525 Apply two nice improvements caught by Ricardo,
spl-05-div64.patch
This is a much less intrusive fix for undefined 64-bit division symbols
when compiling the DMU in 32-bit kernels.

* spl-06-atomic64.patch
This is a workaround for 32-bit kernels that don't have atomic64_t.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@162 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-03 20:34:17 +00:00
behlendo 749045bbfa Apply a nice fix caught by Ricardo,
* spl-04-fix-taskq-spinlock-lockup.patch
Fixes a deadlock in the BIO completion handler, due to the taskq code
prematurely re-enabling interrupts when another spinlock had disabled
them in the IDE IRQ handler.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@161 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-03 20:21:08 +00:00
behlendo f6c81c5ea7 Reviewed and applied spl-01-rm-gpl-symbol-set_cpus_allowed.patch
from Ricardo which removes a dependency on the GPL-only symbol
set_cpus_allowed().  Using this symbol is simpler but in the name
of portability we are adopting a spinlock based solution here
to remove this dependency.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@160 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-03 20:07:20 +00:00
behlendo d50bd9e221 Reviewed and applied spl-00-rm-gpl-symbol-notifier_chain.patch
from Ricardo which removes a dependency on the GPL-only symbol
needed for a panic time notifier.  This funcationality was never
used and this improves our portability.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@159 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-11-03 19:53:23 +00:00
behlendo e73187714d Minor tweak to handle systems with restrictive udev rules
or older systems which are not using udev at all.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@158 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-12 05:18:41 +00:00
behlendo 25557fd884 Sigh more compat fixes, this is almost right for 2.6.9 - 2.6.26 kernels.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@157 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-11 23:47:44 +00:00
behlendo b61a6e8bdc Pull in initial 32-bit support patches.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@156 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-11 22:42:04 +00:00
behlendo 3d061e9d10 Commit bulk of remaining 2.6.9 and 2.6.26 compat changes.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@155 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-11 22:13:47 +00:00
behlendo 322640b7b5 Include linux/uaccess.h compat changes.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@154 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-11 19:10:14 +00:00
behlendo 86de8532a9 More 2.6.26 compat changes
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@153 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-11 17:56:40 +00:00
behlendo 6a6cafbe8d Pull in timespec, list, and type compat changes to support
building against a wider range of kernels.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@152 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-11 17:20:11 +00:00
behlendo 86149aa255 Resolve incomplete type when building against 2.6.26
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@151 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-11 16:11:37 +00:00
behlendo 46c685d0c4 Add class / device portability code. Two autoconf tests
were added to cover the 3 possible APIs from 2.6.9 to
2.6.26.  We attempt to use the newest interfaces and if
not available fallback to the oldest.  This a rework of
some changes proposed by Ricardo for RHEL4.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@150 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-10 03:50:36 +00:00
behlendo 877a32e91e Pull in fls64 compat changes from spl-00-rhel4-compat.patch,
to allow greater compatibility with kernels pre 2.6.16.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@149 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-06 04:52:39 +00:00
behlendo 7afde631f6 Start bringing in Ricardo's spl-00-rhel4-compat.patch, a few chunks
at a time as I audit it.  This chunk finishes moving the SPL entirely
off the linux slab on to the SPL implementation.  It differs slightly
from the proposed version in that the spl continues to export to
all the Solaris types and functions.  These do conflict with the
Linux slab so a module usings these interfaces must not include the
SPL slab if they also intend to use the linux slab.  Or they must
explcitly #undef the macros which remap the functioin to their
spl_* equivilants.

A nice side of effect of dropping the entire linux slab is we
don't need to autoconf checks anymore.  They kept messing with
the slab API endlessly!



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@148 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-05 04:16:09 +00:00
behlendo 73035a29eb Apply Ricardo's spl-02-condvar-poison.patch
Fix too early memory poisoning on condvars.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@147 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-08-04 23:59:08 +00:00
behlendo 5587df4d8e Trivial commit to remove whitespace
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@146 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-07-09 19:11:29 +00:00
behlendo 97f274d46d Fix race in kmem_locking test
Reduce max memory usage for kmem_locking tests (for low memory machines)



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@145 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-07-07 22:15:04 +00:00
behlendo f78a933f8a Two easy fixes I caught with debug enabled
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@143 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-07-01 04:06:09 +00:00
behlendo 3ba97a6743 Update info to prep for a tag. If all goes well I'll have
something I'm not too embarased to distrubute tommorow.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@142 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-07-01 03:42:24 +00:00
behlendo a1502d76ae - Remove hash functionality from slab in favor of direct lookups
based of the spl_kmem_obj_t tacked on the end of each object.
  This actually isn't so back because we are now allocing large
  chunks for the slab and partitioning it ourselves.  So there's
  not a ton of wasted space.  We may suffer a performance hit
  however due to alignment issues.

- Remove remaining depenancies on the linux slab implementation.
  We're standing on our own now for better or worse.

- Rework slabs to be either kmem or vmem based.  If neither
  KMC_VMEM of KMC_KMEM are specified we make a decent guess
  about what will work best for their based on the object 
  size.  Additionally we provide a kmem_virt() function caller
  can use to see if they have a virtual or physical address.

- Minor fixups in the test suite.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@141 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-07-01 03:28:54 +00:00
behlendo 1c3832576d Remove stray call to spl_cache_free() and remove all the
cycle count which was costing me overhead.  It was hurting
performance pretty badly for heavily used caches.  I'm also
thinking the hash may be hurting me as well and it might
be worth sticking a pointer in to a little space after the
alloced object.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@140 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-28 20:03:11 +00:00
behlendo fece7c99bf Victory! I've reworked caches with large objects which are
based by vmalloc()'ed memory.  I now alloc a slab which is
roughly 32*spl_obj_size and in this block of memory I place
the slab descriptor, slab object descriptors, and objects
themselves.  This greatly reduces vmalloc lock contention.

Still some minor cleanup remains and fine tuning but
it's working pretty well.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@139 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-28 05:04:46 +00:00
behlendo ff449ac406 Further slab improvements, I'm getting close to something which works
well for the expected workloads.  Improvement in this commit include:

- Added DEBUG_KMEM_TRACKING #define which can optionally be set
  when DEBUG_KMEM is defined to do per allocation tracking.  This
  allows us to get all the lightweight kmem debugging enabled by
  default which is pretty light weight, and only when looking 
  for a memory leak we can briefly enable the per alloc tracking.

- Added set_normalized_timespec() in to SPL to simply using
  the timespec() primatives from within a module.

- Added per-spinlock cycle counters to the slab in an attempt
  to run down a lock contention issue.  The contended lock 
  was in vmalloc() but I'm going to leave the cycle counters
  in place for a little while until I'm convinced there arn't
  other locking improvement possible in the slab.

- Added a proc interface to the slab to export per slab
  cache statistics to /proc/spl/kmem/slab for analysis.

- Reworked spl_slab_alloc() function to allocate from kmem for
  small allocation and vmem for large allocations.  This improved
  things considerably but futher work is needed.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@138 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-27 21:40:11 +00:00
behlendo e9d7a2bef5 Fix for memory corruption caused by overruning the magazine
when repopulating it.  Plus I fixed a few more suble races in
that part of the code which were catching me.  Finally I fixed
a small race in kmem_test8.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@137 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-26 19:49:42 +00:00
behlendo 4afaaefa05 Implement per-cpu local caches. This seems to have bough me another
factor of 10x improvement on SMP system due to reduced lock contention.
This may put me in the ballpark of what is needed.  We can still further
improve things on NUMA systems by creating an additional L3 cache per 
memory node instead of the current global pool.  With luck this won't
be needed.  I should also take another look at the locking now that
everything is working.  There's a good chance I can tighten it up a
little bit and improve things a little more.

   kmem_lock: time (sec)        slabs           objs            hash
   kmem_lock:                   tot/max/calc    tot/max/calc    size/depth
   kmem_lock:  0.000999926      6/6/1           192/192/32      32768/0
   kmem_lock:  0.000999926      4/4/2           128/128/64      32768/0
   kmem_lock:  0.000999926      4/4/4           128/128/128     32768/0
   kmem_lock:  0.000999926      4/4/8           128/128/256     32768/0
   kmem_lock:  0.000999926      4/4/16          128/128/512     32768/0
   kmem_lock:  0.000999926      4/4/32          128/128/1024    32768/0
   kmem_lock:  0.000999926      4/4/64          128/128/2048    32768/0
   kmem_lock:  0.000999926      8/8/128         256/256/4096    32768/0
   kmem_lock:  0.003999704      24/23/256       768/736/8192    32768/1
   kmem_lock:  0.012999038      44/41/512       1408/1312/16384 32768/1
   kmem_lock:  0.051996153      96/93/1024      3072/2976/32768 32768/2
   kmem_lock:  0.181986536      187/184/2048    5984/5888/65536 32768/3
   kmem_lock:  0.655951469      342/339/4096    10944/10848/131072 32768/4



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@136 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-25 20:57:45 +00:00
behlendo d46630e0f3 The first locking issue was due to the semaphore I used. I was trying
to be overly clever and the context switch when the semaphore was busy
was destroying performance.  Converting to a simple spin lock bough me
a factor of 50 or so.  That said it's still not good enough.  Tests
show bad performance and we are still CPU bound.  The logical fix is
I need to implement per-cpu hot caches to minimize the SMP contention.
Linux and Solaris both have this, I was hoping to do without but it
looks like that's not to be.

   kmem_lock: time (sec)        slabs           objs            hash
   kmem_lock:                   tot/max/calc    tot/max/calc    size/depth
   kmem_lock:  0.022000000      7/6/64  224/177/2048    32768/1
   kmem_lock:  0.039000000      13/13/128       416/404/4096    32768/1
   kmem_lock:  0.079000000      23/21/256       736/672/8192    32768/1
   kmem_lock:  0.158000000      48/47/512       1536/1504/16384 32768/1
   kmem_lock:  0.345000000      105/105/1024    3360/3358/32768 32768/2
   kmem_lock:  0.760000000      202/200/2048    6464/6400/65536 32768/3



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@135 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-24 17:18:15 +00:00
behlendo 44b8f1769f Add another kmem test to check for lock contention in the slab
allocator.  I have serious contention issues here and I needed
a way to easily measure how much the following batch of changes
will improve things.  Currently things are quite bad when the
allocator is highly contended, and interestingly it seems to
get worse in a non-linear fashion... I'm not sure why yet.
I'll figure it out tomorrow.

        kmem:kmem_lock    Pass

   kmem_lock: time (sec)        slabs           objs
   kmem_lock:                   tot/max/calc    tot/max/calc
   kmem_lock:  0.061000000      75/60/64        2400/1894/2048
   kmem_lock:  0.157000000      134/125/128     4288/3974/4096
   kmem_lock:  0.471000000      263/249/256     8416/7962/8192
   kmem_lock:  2.526000000      518/499/512     16576/15957/16384
   kmem_lock: 14.393000000      990/978/1024    31680/31270/32768



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@134 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-23 23:54:52 +00:00
behlendo 5cbd57fa91 Fix minor chaos/fc9 kernel discrepencies in build
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@133 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-13 23:56:26 +00:00
behlendo 2fb9b26a85 * : modules/sys/kmem-slab.c : Re-implemented the slab to no
longer be based on the linux slab but to be its own complete
implementation.  The new slab behaves much more like the
Solaris slab than the Linux slab.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@132 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-13 23:41:06 +00:00
behlendo cfe5749941 Minor tweak to ensure kstat values are printed correctly on x86_64 at least
Additionally fix a minor typo in the .ul ULONG case.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@131 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-06 23:11:34 +00:00
behlendo c58f753ddb Prep for 0.3.2 tag
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@129 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-04 23:28:29 +00:00
behlendo 41cf38df92 Add missing () to quiet warnings in NDEBUG case
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@128 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-04 22:52:13 +00:00
behlendo 3ce1bc96f9 Fix some bad grammer
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@127 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-04 21:25:57 +00:00
behlendo 475cdc788e Just use CONFIG_SLUB to detect SLUB use
Add ASSERTF to the NDEBUG build
Fix minor issue with various debug build flags



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@126 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-04 21:09:25 +00:00
behlendo a02118a89d Whoops, fix a minor proc issue which slipped through with
the recent changes.  Ensure the top level spl is removed.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@125 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-04 06:09:16 +00:00
behlendo c30df9c863 Fixes:
1) Ensure mutex_init() never fails in the case of ENOMEM by retrying
   forever.  I don't think I've ever seen this happen but it was clear
   after code inspection that if it did we would immediately crash.

2) Enable full debugging in check.sh for sanity tests.  Might as well
   get as much debug as we can in the case of a failure.

3) Reworked list of kmem caches tracked by SPL in to a hash with the
   key based on the address of the kmem_cache_t.  This should speed
   up the constructor/destructor/shrinker lookup needed now for newer
   kernel which removed the destructor support.

4) Updated kmem_cache_create to handle the case where CONFIG_SLUB
   is defined.  The slub would occasionally merge slab caches which
   resulted in non-unique keys for our hash lookup in 3).  To fix this
   we detect if the slub is enabled and then set the needed flag
   to prevent this merging from ever occuring.

5) New kernels removed the proc_dir_entry pointer from items
   registered by sysctl.  This means we can no long be sneaky and
   manually insert things in to the sysctl tree simply by walking
   the proc tree.  So I'm forced to create a seperate tree for
   all the things I can't easily support via sysctl interface.
   I don't like it but it will do for now.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@124 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-04 06:00:46 +00:00
behlendo 691d2bd733 Update utsname to use proper compatible interface to avoid API issues.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@123 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-03 21:20:18 +00:00
behlendo 684f787474 Fix missing return resulting in a double unlock of &files->file_lock
and a hang on subsequent sys_close.  I'm not quite sure why the Fedora
kernel caught this bug the Chaos kernel did not, but I'm glad!

Convert remaining BUG_ON's to ASSERTs



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@122 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-03 20:58:55 +00:00
behlendo fe81cb1c43 Add the minimal set of kernel patches need to for the SPL. Hopefully
even these will not be needed over the next few weeks.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@121 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-02 19:45:04 +00:00
behlendo f93f7c8dbe This should have been part of the previous autoconf commit.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@120 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-02 18:41:30 +00:00
behlendo 57d862349b Breaking the world for a little bit. If anyone is going to continue
working on this branch for the next few days I suggested you work
off of the 0.3.1 tag.  The following changes are fairly extensive
and are designed to make the SPL compatible with all kernels in
the range of 2.6.18-2.6.25.  There were 13 relevant API changes
between these releases and I have added the needed autoconf tests
to check for them.  However, this has not all been tested extensively.
I'll sort of the breakage on Fedora Core 9 and RHEL5 this week.

SPL_AC_TYPE_UINTPTR_T
SPL_AC_TYPE_KMEM_CACHE_T
SPL_AC_KMEM_CACHE_DESTROY_INT
SPL_AC_ATOMIC_PANIC_NOTIFIER
SPL_AC_3ARGS_INIT_WORK
SPL_AC_2ARGS_REGISTER_SYSCTL
SPL_AC_KMEM_CACHE_T
SPL_AC_KMEM_CACHE_CREATE_DTOR
SPL_AC_3ARG_KMEM_CACHE_CREATE_CTOR
SPL_AC_SET_SHRINKER
SPL_AC_PATH_IN_NAMEIDATA
SPL_AC_TASK_CURR
SPL_AC_CTL_UNNUMBERED



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@119 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-06-02 17:28:49 +00:00
behlendo 65a045dace Make a tag just for release to CEA, this has the URCL
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@115 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-26 05:01:15 +00:00
behlendo 715f625146 Go through and add a header with the proper UCRL number.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@114 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-26 04:38:26 +00:00
behlendo b2585b36d3 Prep for for 0.3.0 tag, this is the tag which was used for all
the performance results to date.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@112 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-21 21:11:47 +00:00
behlendo cc7449ccd6 - Properly fix the debug support for all the ASSERT's, VERIFIES, etc can be
compiled out when doing performance runs.
- Bite the bullet and fully autoconfize the debug options in the configure
  time parameters.  By default all the debug support is disable in the core
  SPL build, but available to modules which enable it when building against
  the SPL.  To enable particular SPL debug support use the follow configure
  options:

  --enable-debug		Internal ASSERTs
  --enable-debug-kmem		Detailed memory accounting
  --enable-debug-mutex		Detailed mutex tracking
  --enable-debug_kstat          Kstat info exported to /proc
  --enable-debug-callb		Additional callb debug



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@111 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-19 02:49:12 +00:00
behlendo 6ab69573ff SPL additions to increase support for updated ZFS build
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@110 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-15 23:39:19 +00:00
behlendo 56f9245330 Disable adaptive mutexs by default (always sleep), and while
I'm at it add a module option for easy tuning.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@109 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-15 17:32:41 +00:00
behlendo 4efd41189a Rework condition variable implementation to be consistent with
other primitive implementations.  Additionally ensure that GFP_ATOMIC
is use for allocations when in interrupt context.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@108 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-15 17:10:30 +00:00
behlendo a97df54e83 Enhanse the thread interface to do something quasi inteligent
with the function name passed to be used as a thread name.  Leaving
the trailing _thread is just redundant so just strip it this
make the thread names far more readable.

Use a strncpy in spl-mutex  just to be safe.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@107 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-12 18:54:08 +00:00
behlendo 8464443f8d Add a comment so I remember to fix this.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@106 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-12 16:53:41 +00:00
behlendo c6dc93d6a8 By default disable extra KMEM and MUTEX debugging to aid performance.
They can easily be re-enabled when new stability issues are uncovered.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@105 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-09 22:53:20 +00:00
behlendo 5c2bb9b2c3 Stability hack. Under Solaris when KM_SLEEP is set kmem_cache_alloc()
may not fail.  To get this behavior I'd added a retry to the shim layer
even though it is abusive to the VM, at least it should prevent the crash.
Additionally I added a proc counter so I can easily check how often this
is happening.  It should be fairly rare, but likely will get worse and
worse the longer the machine has been up.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@104 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-09 21:21:33 +00:00
behlendo 04a479f706 Add an almost feature complete implemenation of kstat. I chose
not to support a few flags (we assert if they are used), and I
did not add the libkstat interface and instead exported everything
to proc for easy access.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@103 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-08 23:21:47 +00:00
behlendo d4c540de38 Same deal as ZFS, we're quite stable now so tag it.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@101 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-07 20:12:44 +00:00
behlendo 427a782d7d Decrease of kmem warnign threshold back to 2 pages, no worse than a stack.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@100 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-07 19:33:01 +00:00
behlendo 13cdca65ec Add vmem memory accounting
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@99 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-07 18:54:32 +00:00
behlendo 404992e31a - Relocate 'stats_per' in to proper /proc/sys/spl/mutex/ directory
- Shift to spinlock for mutex list addition and removal



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@98 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-07 17:58:22 +00:00
behlendo 4f86a887d8 Remaining issues fixed after reenabled mutex debugging.
- Ensure the mutex_stats_sem and mutex_stats_list are initialized
- Only spin if you have to in mutex_init



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@97 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-06 23:19:27 +00:00
behlendo e8b31e8482 - Updated rwlock's to reside in a .c file instead of a static inline
- Updated rwlock's so they can be safely initialized in ctors.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@96 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-06 23:00:49 +00:00
behlendo d6a26c6a32 Lots of fixes here:
- Detailed kmem memory allocation tracking.  We can now get on
  spl module unload a list of all memory allocations which were
  not free'd and where the original alloc was.  E.g.

SPL: 15554:632:(spl-kmem.c:442:kmem_fini()) kmem leaked 90/319332 bytes
SPL: 15554:648:(spl-kmem.c:451:kmem_fini()) address          size  data             func:line
SPL: 15554:648:(spl-kmem.c:457:kmem_fini()) ffff8100734b68b8 32    0100000001005a5a __spl_mutex_init:70
SPL: 15554:648:(spl-kmem.c:457:kmem_fini()) ffff8100734b6148 13    &tl->tl_lock     __spl_mutex_init:74
SPL: 15554:648:(spl-kmem.c:457:kmem_fini()) ffff81007ac43730 32    0100000001005a5a __spl_mutex_init:70
SPL: 15554:648:(spl-kmem.c:457:kmem_fini()) ffff81007ac437d8 13    &tl->tl_lock     __spl_mutex_init:74

- Shift to using rwsems in kmem implmentation, to simply locking and
  improve concurency.

- Shift to using rwsems in mutex implementation, additionally ensure we
  never sleep in the init function if non-zero preempt_count or 
  interrupts are disabled as can happen in a slab cache ctor/dtor.

- Other minor formating fixes and such.

TODO:

- Finish the vmem memory allocation tracking

- Vet all other SPL primatives for potential sleeping during *_init.  I
suspect the rwlock implemenation does this and should be fixes just
like the mutex implemenation.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@95 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-06 20:38:28 +00:00
behlendo 9ab1ac14ad Commit adaptive mutexes. This seems to have introduced some new
crashes but it's not clear to me yet if these are a problem with
the mutex implementation or ZFSs usage of it.

Minor taskq fixes to add new tasks to the end of the pending list.

Minor enhansements to the debug infrastructure.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@94 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-05-05 20:18:49 +00:00
behlendo bcd68186d8 New an improved taskq implementation for the SPL. It allows a
configurable number of threads like the Solaris version and almost
all of the options are supported.  Unfortunately, it appears to have
made absolutely no difference to our performance numbers.  I need
to keep looking for where we are bottle necking.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@93 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-25 22:10:47 +00:00
behlendo 839d8b438e Update kmem.h to properly use new debug subsystem.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@92 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-24 20:21:07 +00:00
behlendo 3561541c24 Prep for 0.2.1 tag
Minor fixes to headers to use debug macros
Added /proc/sys/spl/version



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@90 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-24 17:41:23 +00:00
wartens2 1bac409fa3 Forgot to update the ChangeLog.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@89 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-24 17:24:02 +00:00
wartens2 8100fe56f1 Make sure that when calling __vmem_alloc that we
do not have __GFP_ZERO set.  Once the memory is allocated
then zero out the memory if __GFP_ZERO is passed to
__vmem_alloc.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@88 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-24 17:07:56 +00:00
behlendo 6e605b6e58 Minor improvement to taskq handling. This is a small step towards
dynamic taskqs which still need to be fully implemented.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@87 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-23 21:19:47 +00:00
behlendo 18c9eadf97 Be careful to never use any of the debug infrastructure either
before the debug subsystem is fully set up, or after the debug
subsystem has been torn down.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@86 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-22 22:22:02 +00:00
behlendo 7e4e211333 Give it a real version for a tag
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@84 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-22 20:10:55 +00:00
behlendo b831734a43 Stack usage is my enemy. Trade cpu cycles in the debug code to
ensure I never add anything to the stack I don't absolutely need.
All this debug code could be removed from a production build
anyway so I'm not so worried about the performance impact.  We
may also consider revisting the mutex and condvar implementation
to ensure no additional stack is used there.

Initial indications are I have reduced the worst case stack
usage to 9080 bytes.  Still to large for the default 8k stacks
so I have been forced to run with 16k stacks until I can
reduce the worst offenders.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@83 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-22 16:55:26 +00:00
behlendo 7fea96c04f More fixes to ensure we get good debug logs even if we're in the
process of destroying the stacks.  Threshhold set fairly aggressively
top 80% of stack usage.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@82 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-21 22:44:11 +00:00
behlendo e5bbd245e3 Added 4 missing subsystem flags
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@81 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-21 18:43:02 +00:00
behlendo a8ac0b8966 Whoops, missed an instance where we could recursively stack check... bad.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@80 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-21 18:16:04 +00:00
behlendo 892d51061e Handful of minor stack checking fixes
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@79 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-21 18:08:33 +00:00
behlendo 937879f11d Update SPL to use new debug infrastructure. This means:
- Replacing all BUG_ON()'s with proper ASSERT()'s
- Using ENTRY,EXIT,GOTO, and RETURN macro to instument call paths



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@78 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-21 17:29:47 +00:00
behlendo 2fae1b3d0a Frist minor batch of fixes. Catch a dropped ;, and use SBUG instead of BUG.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@77 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-19 00:02:11 +00:00
behlendo ce86265693 Whoops need this!
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@76 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-18 23:42:45 +00:00
behlendo 57d1b18858 First commit of lustre style internal debug support. These
changes bring over everything lustre had for debugging with
two exceptions.  I dropped by the debug daemon and upcalls
just because it made things a little easier.  They can be
readded easily enough if we feel they are needed.

Everything compiles and seems to work on first inspection
but I suspect there are a handful of issues still lingering
which I'll be sorting out right away.  I just wanted to get
all these changes commited and safe.  I'm getting a little
paranoid about losing them.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@75 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-18 23:39:58 +00:00
wartens2 55152ebbb4 * modules/spl/spl-kmem.c : Make sure to disable interrupts
when necessary to avoid deadlocks.  We were seeing the deadlock
        when calling kmem_cache_generic_constructor() and then an interrupt
        forced us to end up calling kmem_cache_generic_destructor()
        which caused our deadlock.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@74 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-16 16:37:51 +00:00
behlendo d61e12af5a - Add some spinlocks to cover all the private data in the mutex. I don't
think this should fix anything but it's a good idea regardless.

- Drop the lock before calling the construct/destructor for the slab
otherwise we can't sleep in a constructor/destructor and for long running
functions we may NMI.

- Do something braindead, but safe for the console debug logs for now.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@73 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-15 20:53:36 +00:00
behlendo c5fd77fcbf Just cleanup up an error case to avoid overspamming the console.
We get the stack once from the BUG() no reason to dump it twice.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@72 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-14 18:37:20 +00:00
behlendo f23e92fabf Add hw_serial support based on a usermodehelper which runs
at spl module load time can calls hostid.  The resolved hostid
is then fed back in to a proc entry for latter use.  It's
not a pretty thing, but it will work for now.  The hw_serial
is required for things such as 'zpool status' to work.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@71 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-12 04:27:59 +00:00
behlendo 12ea923056 Adjust the condition variables to simply sleep uninteruptibly.
This way we don't have to contend with superious wakeups which
it appears ZFS is not so careful to handle anyway.  So this is
probably for the best.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@70 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-11 22:49:48 +00:00
behlendo 115aed0dd8 - Add more strict in_atomic() checking to the mutex entry
function just to be extra safety and paranoid.

- Rewrite the thread shim to take full advantage of the
new kernel kthread API.  This greatly simplifies things.

- Add a new regression test for thread_exit() to ensure
it properly terminates a thread immediately without
allowing futher execution of the thread.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@69 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-11 17:03:57 +00:00
behlendo 79f92663e3 Fix race in rwlock implementation which can occur when
your task is rescheduled to a different cpu after you've
taken the lock but before calling RW_LOCK_HELD is called.
We need the spinlock to ensure there is a wmb() there.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@68 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-07 23:54:34 +00:00
behlendo 728b9dd800 - Fix write-only behavior in vn-open()
- Ensure we have at least 1 write-only splat test
- Fix return codes for vn_* Solaris does not use negative return 
  codes in the kernel.  So linux errno's must be inverted.




git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@67 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-04 17:08:12 +00:00
behlendo 968eccd1d1 Update the thread shim to use the current kernel threading API.
We need to use kthread_create() here for a few reasons.  First
off to old kernel_thread() API functioin will be going away.
Secondly, and more importantly if I use kthread_create() we can
then properly implement a thread_exit() function which terminates
the kernel thread at any point with do_exit().  This fixes our
cleanup bug which was caused by dropping a mutex twice after
thread_exit() didn't really exit.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@66 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-04 04:44:16 +00:00
behlendo 996faa6869 Correctly implement atomic_cas_ptr() function. Ideally all of these
atomic operations will be rewritten anyway with the correct arch
specific assembly.  But not today.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@65 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-03 21:48:57 +00:00
behlendo 0a6fd143fd - Remapped ldi_handle_t to struct block_device * which is much more useful
- Added liunx block device headers to sunldi.h
- Made __taskq_dispatch safe for interrupt context where it turns out we
  need to be useing it.
- Fixed NULL const/dest bug for kmem slab caches
- Places debug __dprintf debugging messages under a spin_lock_irqsave
  so it's safe to use then in interrupt handlers.  For debugging only!



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@64 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-03 16:33:31 +00:00
behlendo 0998fdd6db Apparently it's OK for done to be NULL, which was not clear in the
Solaris man page.  Anyway, since apparently this usage is accectable
I've updated the function to handle it.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@63 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-01 17:00:06 +00:00
behlendo 6a585c61de Double large kmalloc warning size to 4 pages. It was 2 pages, and ideally
it should be dropped to one page but in the short term we should be able
to easily live with 4 page allocations.

Fix the nvlist bug, it turns out the user space side of things were
packing the nvlists correctly as little endian, and the kernel space
side of things due to a missing #define were unpacking them as big endian.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@62 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-01 16:09:18 +00:00
behlendo e966e04fd5 Ensure all file ops pointer are NULL or we may end up
calling garbage pointers on open/close etc and get
what look like random crashes.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@61 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-04-01 03:24:17 +00:00
behlendo 4fd2f7eea5 Add vmem_zalloc support.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@60 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-31 23:04:07 +00:00
behlendo 8d0f1ee907 Add some crude debugging support. It leaves alot to be
desired, but it should allow more easy kernel debugging for now.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@59 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-31 20:42:36 +00:00
behlendo e487ee08fb Fixed that.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@58 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-28 18:22:29 +00:00
behlendo 9f4c835a0e Correctly functioning 64-bit atomic shim layer. It's not
what I would call effecient but it does have the advantage
of being correct which is all I need right now.  I added
a regression test as well.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@57 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-28 18:21:09 +00:00
behlendo 4a4295b267 Remove minor lingering CDDL tait of copied headers. Required
headers rewritten to include minimally what we need.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@56 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-27 23:40:09 +00:00
behlendo d429b03d85 - Thinko fix to the SPL module interface
- Enhanse the VERIFY() support to output the values which
  failed to compare as expected before crashing.  This make
  debugging much much much easier.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@55 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-27 22:06:59 +00:00
behlendo 8ac547ec4c Relocated to zfs repo
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@54 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-27 17:58:10 +00:00
behlendo 336bb0c0c1 Two fixes to the module interface. Could be worse!
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@53 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-21 19:16:25 +00:00
behlendo 4e62fd4104 OK, a first reasonable attempt at a solaris module/chdev shim layer.
This should handle the absolute minimum I need for ZFS.  It will 
register the chdev with the right callbacks.  Then the generic 
registered linux callback will find the right registered solaris
callback for the function and munge the args just right before
passing it on.  Should work, but untested (just compiled), so I
expect bugs.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@52 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-20 23:30:15 +00:00
behlendo e4f1d29f89 OK, some pretty substantial rework here. I've merged the spl-file
stuff which only inclused the getf()/releasef() in to the vnode area
where it will only really be used.  These calls allow a user to
grab an open file struct given only the known open fd for a particular
user context.  ZFS makes use of these, but they're a bit tricky to
test from within the kernel since you already need the file open
and know the fd.  So we basically spook the system calls to setup
the environment we need for the splat test case and verify given
just the know fd we can get the file, create the needed vnode, and
then use the vnode interface as usual to read and write from it.

While I was hacking away I also noticed a NULL termination issue
in the second kobj test case so I fixed that too.  In fact, I fixed
a few other things as well but all for the best!



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@51 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-18 23:20:30 +00:00
behlendo 5d86345d37 Initial pass at a file API getf/releasef hooks
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@50 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-18 04:56:43 +00:00
behlendo 1ec74a114c Minimal signal handling interface.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@49 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-17 18:29:57 +00:00
behlendo 2bdb28fbe0 Missing headers, more minor fixes
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@48 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-15 00:05:38 +00:00
behlendo c19c06f3b0 Fix kmem memory accounting
Adjust kmem slab interface to make a copy of the slab name before
passing it on to the linux slab (we free it latter too)



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@47 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-14 20:56:26 +00:00
behlendo 79b31f3601 Fix KMEM_DEBUG support (enable by default)
Add vmem_alloc/vmem_free support (and test case)
Add missing time functions



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@46 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-14 19:04:41 +00:00
behlendo af828292e5 Add missing headers
Rework vnodes to be based on the slab cache, just like on Solaris.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@45 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-14 00:04:01 +00:00
behlendo ea19fbed05 Add missing headers
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@44 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-13 22:52:23 +00:00
behlendo 8ddd0ee415 Add two more missing headers
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@43 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-13 20:41:29 +00:00
behlendo 73e540a0d1 Drop unicode support, provided in ZFS tree libport
Update uio support


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@42 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-13 19:49:09 +00:00
behlendo 36e6f86146 - Add some more missing headers
- Map the LE/BE_* byteorder macros to the linux versions
- More minor vnodes fixes


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@41 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-12 23:48:28 +00:00
behlendo 2f5d55aac5 Add copyin/copyout mapping
Fix some vnode types



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@40 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-12 21:33:28 +00:00
behlendo 4b17158506 - Implemented vnode interfaces and 6 test cases to the test suite.
- Re-implmented kobj support based on the vnode support.
- Add TESTS option to check.sh, and removed delay after module load.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@39 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-12 20:52:46 +00:00
behlendo 9490c14835 Apply fix from bug239 for rwlock deadlock.
Update check.sh script to take V=1 env var so you can run it verbosely as
follows if your chasing something: sudo make check V=1

Add new kobj api and needed regression tests to allow reading of files from
within the kernel.  Normally thats not something I support but the spa layer
needs the support for its config file.

Add some more missing stub headers



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@38 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-11 20:54:40 +00:00
behlendo b123971fc2 Two more GPL only symbols moved to helper functions in the spl module.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@37 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-11 02:08:57 +00:00
behlendo ee4766827a Remap gethrestime() with #define to new symbol and export that new
symbol to avoid direct use of GPL only symbol.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@36 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-10 21:38:39 +00:00
behlendo 51f443a074 Add some typedefs to make it clearer when we passing a function,
Add fm_panic define
Add another bad atomic hack (need to do this right)


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@35 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-10 19:25:20 +00:00
behlendo 4098c921b6 Fix systemic naming mistake
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@34 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-10 19:04:14 +00:00
behlendo 6adf99e7d6 Add missing headers
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@33 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-10 17:05:34 +00:00
behlendo 12472b242d Just filling in more of the env.
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@32 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-08 00:58:32 +00:00
behlendo 05ae387b50 Add somre debugging support
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@31 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-08 00:18:21 +00:00
behlendo 0b3cf046cb Add the initial vestigates of vnode support
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@30 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-07 23:07:02 +00:00
behlendo 3b3ba48fe9 Add missing cred.h functions
Resolve compiler warning with kmem_free (unused len)
Add stub for byteorder.h
Add zlib shim for compress2 and uncompress functions



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@29 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-07 20:48:44 +00:00
behlendo b0dd3380aa Minor atomic cleanup, this needs to be done right.
Fixed a bug in the timer code
Added missing macros



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@28 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-07 00:28:32 +00:00
behlendo ed61a7d05e Add some missing rw_lock symbols
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@27 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-06 23:42:37 +00:00
behlendo 77b1fe8fa8 Add highbit func,
Add sloopy atomic declaration which will need to be fixed (eventually)
Fill out more of the Solaris VM hooks
Adjust the create_thread function



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@26 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-06 23:12:55 +00:00
behlendo a713518f5d Checkpoint for the night,
added a few more stub headers,
fleshed out a few stub headers,
added a FIXME file,
added various compatibility macros


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@25 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-05 00:58:54 +00:00
behlendo 23f28c4f75 Remove spl.h, just include the headers you need.
Add a few more stubs.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@24 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-04 20:06:29 +00:00
behlendo 48f940b943 Fix type
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@23 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-04 19:38:27 +00:00
behlendo 14c5326ccd More stub headers,
moved generic to sysmacros,
added some more macros for kernel compatibility


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@22 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-04 18:22:31 +00:00
behlendo dbb484ec60 Stub out some missing headers which are expected. I'll fill
in what the contents need to be as I encounter the warnings
about missing prototypes, symbols, and such.


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@21 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-01 18:30:12 +00:00
behlendo d5f087adef Remaining lose ends
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@20 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-01 00:51:41 +00:00
behlendo ea70970ff5 Almost dropped this!
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@19 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-01 00:47:23 +00:00
behlendo f4b377415b Reorganize /include/ to add a /sys/, this way we don't need to
muck with #includes in existing Solaris style source to get it
to find the right stuff.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@18 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-03-01 00:45:59 +00:00
behlendo 09b414e880 Minor nit, SOLARIS should be SPL
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@17 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-28 00:52:31 +00:00
behlendo 596e65b4e8 OK, I think this is the last of major cleanup and restructuring.
We've dropped all the linux- prefixes on the file in favor of spl-
which makes more sense.  And we've cleaned up some of the includes
so everybody should be including their own dependencies properly.
All a module which wants to use the spl support needs to do in
include spl.h and ensure it has access to Module.symvers.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@16 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-28 00:48:31 +00:00
behlendo 07d339d467 Add top level make check target which runs the validation
suite.  Careful with this right now one of the tests still
causes a lockup on the node.  This happened before the move
from the ZFS repo so its not a new issue.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@15 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-28 00:16:24 +00:00
behlendo 7c50328b40 More cleanup.
- Removed all references to kzt and replaced with splat
- Moved portions of include files which do not need to be
  available to all source files in to local.h files in 
  proper source subdirs.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@14 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-27 23:42:31 +00:00
behlendo 70eadc1958 OK, it builds... and the modules load... now for some more
cleanup to remove the remaining vestages of the time it lives
with the ZFS code. 


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@13 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-27 21:56:51 +00:00
behlendo e4009e98c7 Quiet libtool
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@12 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-27 20:55:24 +00:00
behlendo a0aadf5666 OK, everything builds now. My initial intent was to place all of
the directories at the top level but that proved troublesome.  The
kernel buildsystem and autoconf were conflicting too much.  To 
resolve the issue I moved the kernel bits in to a modules directory
which can then only use the kernel build system.  We just pass 
along the likely make targets to the kernel build system.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@11 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-27 20:52:44 +00:00
behlendo 1735fa73f4 New approach
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@10 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-27 20:28:52 +00:00
behlendo 032d12a900 Move dir
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@9 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-27 19:36:31 +00:00
behlendo d01858a1ca Move dir
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@8 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-27 19:36:20 +00:00
behlendo ce58df9226 Move dir
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@7 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-27 19:36:07 +00:00
behlendo 15821fd660 Move dir
git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@6 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-27 19:35:54 +00:00
behlendo f1b59d2620 Lots of build fixes. This is turning out to be a very good
idea since it forcefully codifing the ABI.  Since the shim
layer is no longer linked at build time in to the test suite
we can;'t cut any corners and get away with it.

Everything is working now with the exception of sorting
setting Module.symvers properly.  This may take a little
Makefile reorg.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@5 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-27 19:09:51 +00:00
behlendo 3d4ea0ced6 More build fixes, I have the kernel module almost building and its
feeling a lot more sane, cleaner, and linuxy.  I may finish this tonight.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@4 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-27 00:59:48 +00:00
behlendo 564f6d1509 User space build fixes:
- Add list handling compatibility library
- Drop uu_* list handling in favor of local list implementation
- libtoolize
- generic makefile cleanup



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@3 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-26 23:20:41 +00:00
behlendo 8f48c2c853 Whoops, I knew I'd miss something small in the build system. Fix
it


git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@2 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-26 20:37:43 +00:00
behlendo f1ca4da6f7 Initial commit. All spl source written up to this point wrapped
in an initial reasonable autoconf style build system.  This does
not yet build but the configure system does appear to work properly
and integrate with the kernel.  Hopefully the next commit gets
us back to a buildable version we can run the test suite against.



git-svn-id: https://outreach.scidac.gov/svn/spl/trunk@1 7e1ea52c-4ff2-0310-8f11-9dd32ca42a1c
2008-02-26 20:36:04 +00:00
1739 changed files with 171156 additions and 35534 deletions
+93 -26
View File
@@ -27,11 +27,14 @@ started?](#what-should-i-know-before-i-get-started)
* [Commit Message Formats](#commit-message-formats)
* [New Changes](#new-changes)
* [OpenZFS Patch Ports](#openzfs-patch-ports)
* [Coverity Defect Fixes](#coverity-defect-fixes)
* [Signed Off By](#signed-off-by)
Helpful resources
* [ZFS on Linux wiki](https://github.com/zfsonlinux/zfs/wiki)
* [OpenZFS Documentation](http://open-zfs.org/wiki/Developer_resources)
* [Git and GitHub for beginners](https://github.com/zfsonlinux/zfs/wiki/Git-and-GitHub-for-beginners)
## What should I know before I get started?
@@ -53,14 +56,15 @@ of these tools are discussed in detail on the [debugging ZFS wiki
page](https://github.com/zfsonlinux/zfs/wiki/Debugging).
### Where can I ask for help?
The [mailing list](https://github.com/zfsonlinux/zfs/wiki/Mailing-Lists)
is the best place to ask for help.
[The zfs-discuss mailing list or IRC](http://list.zfsonlinux.org)
are the best places to ask for help. Please do not file support requests
on the GitHub issue tracker.
## How Can I Contribute?
### Reporting Bugs
*Please* contact us via the [mailing
list](https://github.com/zfsonlinux/zfs/wiki/Mailing-Lists) if you aren't
*Please* contact us via the [zfs-discuss mailing
list or IRC](http://list.zfsonlinux.org) if you aren't
certain that you are experiencing a bug.
If you run into an issue, please search our [issue
@@ -167,18 +171,10 @@ first line in the commit message.
please summarize important information such as why the proposed
approach was chosen or a brief description of the bug you are resolving.
Each line of the body must be 72 characters or less.
* The last line must be a `Signed-off-by:` tag with the developer's
name followed by their email. This is the developer's certification
that they have the right to submit the patch for inclusion into
the code base and indicates agreement to the [Developer's Certificate
of Origin](https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin).
Code without a proper signoff cannot be merged.
* The last line must be a `Signed-off-by:` tag. See the
[Signed Off By](#signed-off-by) section for more information.
Git can append the `Signed-off-by` line to your commit messages. Simply
provide the `-s` or `--signoff` option when performing a `git commit`.
For more information about writing commit messages, visit [How to Write
a Git Commit Message](https://chris.beams.io/posts/git-commit/).
An example commit message is provided below.
An example commit message for new changes is provided below.
```
This line is a brief summary of your change
@@ -192,23 +188,23 @@ Signed-off-by: Contributor <contributor@email.com>
```
#### OpenZFS Patch Ports
If you are porting an OpenZFS patch, the commit message must meet
If you are porting OpenZFS patches, the commit message must meet
the following guidelines:
* The first line must be the summary line from the OpenZFS commit.
It must begin with `OpenZFS dddd - ` where `dddd` is the OpenZFS issue number.
* Provides a `Authored by:` line to attribute the patch to the original author.
* Provides the `Reviewed by:` and `Approved by:` lines from the original
* The first line must be the summary line from the most important OpenZFS commit being ported.
It must begin with `OpenZFS dddd, dddd - ` where `dddd` are OpenZFS issue numbers.
* Provides a `Authored by:` line to attribute each patch for each original author.
* Provides the `Reviewed by:` and `Approved by:` lines from each original
OpenZFS commit.
* Provides a `Ported-by:` line with the developer's name followed by
their email.
* Provides a `OpenZFS-issue:` line which is a link to the original illumos
their email for each OpenZFS commit.
* Provides a `OpenZFS-issue:` line with link for each original illumos
issue.
* Provides a `OpenZFS-commit:` line which links back to the original OpenZFS
commit.
* Provides a `OpenZFS-commit:` line with link for each original OpenZFS commit.
* If necessary, provide some porting notes to describe any deviations from
the original OpenZFS commit.
the original OpenZFS commits.
An example OpenZFS patch port commit message is provided below.
An example OpenZFS patch port commit message for a single patch is provided
below.
```
OpenZFS 1234 - Summary from the original OpenZFS commit
@@ -223,3 +219,74 @@ Provide some porting notes here if necessary.
OpenZFS-issue: https://www.illumos.org/issues/1234
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/abcd1234
```
If necessary, multiple OpenZFS patches can be combined in a single port.
This is useful when you are porting a new patch and its subsequent bug
fixes. An example commit message is provided below.
```
OpenZFS 1234, 5678 - Summary of most important OpenZFS commit
1234 Summary from original OpenZFS commit for 1234
Authored by: Original Author <original@email.com>
Reviewed by: Reviewer Two <reviewer2@email.com>
Approved by: Approver One <approver1@email.com>
Ported-by: ZFS Contributor <contributor@email.com>
Provide some porting notes here for 1234 if necessary.
OpenZFS-issue: https://www.illumos.org/issues/1234
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/abcd1234
5678 Summary from original OpenZFS commit for 5678
Authored by: Original Author2 <original2@email.com>
Reviewed by: Reviewer One <reviewer1@email.com>
Approved by: Approver Two <approver2@email.com>
Ported-by: ZFS Contributor <contributor@email.com>
Provide some porting notes here for 5678 if necessary.
OpenZFS-issue: https://www.illumos.org/issues/5678
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/efgh5678
```
#### Coverity Defect Fixes
If you are submitting a fix to a
[Coverity defect](https://scan.coverity.com/projects/zfsonlinux-zfs),
the commit message should meet the following guidelines:
* Provides a subject line in the format of
`Fix coverity defects: CID dddd, dddd...` where `dddd` represents
each CID fixed by the commit.
* Provides a body which lists each Coverity defect and how it was corrected.
* The last line must be a `Signed-off-by:` tag. See the
[Signed Off By](#signed-off-by) section for more information.
An example Coverity defect fix commit message is provided below.
```
Fix coverity defects: CID 12345, 67890
CID 12345: Logically dead code (DEADCODE)
Removed the if(var != 0) block because the condition could never be
satisfied.
CID 67890: Resource Leak (RESOURCE_LEAK)
Ensure free is called after allocating memory in function().
Signed-off-by: Contributor <contributor@email.com>
```
#### Signed Off By
A line tagged as `Signed-off-by:` must contain the developer's
name followed by their email. This is the developer's certification
that they have the right to submit the patch for inclusion into
the code base and indicates agreement to the [Developer's Certificate
of Origin](https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin).
Code without a proper signoff cannot be merged.
Git can append the `Signed-off-by` line to your commit messages. Simply
provide the `-s` or `--signoff` option when performing a `git commit`.
For more information about writing commit messages, visit [How to Write
a Git Commit Message](https://chris.beams.io/posts/git-commit/).
+11 -9
View File
@@ -1,3 +1,5 @@
<!-- Please fill out the following template, which will help other contributors address your issue. -->
<!--
Thank you for reporting an issue.
@@ -14,15 +16,15 @@ Please fill in as much of the template as possible.
-->
### System information
<!-- add version after "|" character -->
Type | Version/Name
--- | ---
Distribution Name |
Distribution Version |
Linux Kernel |
Architecture |
ZFS Version |
SPL Version |
<!-- add version after "|" character -->
Type | Version/Name
--- | ---
Distribution Name |
Distribution Version |
Linux Kernel |
Architecture |
ZFS Version |
SPL Version |
<!--
Commands to find ZFS/SPL versions:
modinfo zfs | grep -iw version
+9 -8
View File
@@ -1,3 +1,5 @@
<!--- Please fill out the following template, which will help other contributors review your Pull Request. -->
<!--- Provide a general summary of your changes in the Title above -->
<!---
@@ -5,13 +7,13 @@ Documentation on ZFS Buildbot options can be found at
https://github.com/zfsonlinux/zfs/wiki/Buildbot-Options
-->
### Description
<!--- Describe your changes in detail -->
### Motivation and Context
<!--- Why is this change required? What problem does it solve? -->
<!--- If it fixes an open issue, please link to the issue here. -->
### Description
<!--- Describe your changes in detail -->
### How Has This Been Tested?
<!--- Please describe in detail how you tested your changes. -->
<!--- Include details of your testing environment, and the tests you ran to -->
@@ -30,10 +32,9 @@ https://github.com/zfsonlinux/zfs/wiki/Buildbot-Options
### Checklist:
<!--- Go over all the following points, and put an `x` in all the boxes that apply. -->
<!--- If you're unsure about any of these, don't hesitate to ask. We're here to help! -->
- [ ] My code follows the ZFS on Linux code style requirements.
- [ ] My code follows the ZFS on Linux [code style requirements](https://github.com/zfsonlinux/zfs/blob/master/.github/CONTRIBUTING.md#coding-conventions).
- [ ] I have updated the documentation accordingly.
- [ ] I have read the **CONTRIBUTING** document.
- [ ] I have added tests to cover my changes.
- [ ] I have read the [**contributing** document](https://github.com/zfsonlinux/zfs/blob/master/.github/CONTRIBUTING.md).
- [ ] I have added [tests](https://github.com/zfsonlinux/zfs/tree/master/tests) to cover my changes.
- [ ] All new and existing tests passed.
- [ ] All commit messages are properly formatted and contain `Signed-off-by`.
- [ ] Change has been approved by a ZFS on Linux member.
- [ ] All commit messages are properly formatted and contain [`Signed-off-by`](https://github.com/zfsonlinux/zfs/blob/master/.github/CONTRIBUTING.md#signed-off-by).
+9 -17
View File
@@ -1,30 +1,22 @@
codecov:
notify:
require_ci_to_pass: no
require_ci_to_pass: false # always post
after_n_builds: 2 # user and kernel
coverage:
precision: 2
round: down
range: "50...100"
precision: 2 # 2 digits of precision
range: "50...90" # red -> yellow -> green
status:
project:
default:
threshold: 1%
threshold: 1% # allow 1% coverage variance
patch:
default:
threshold: 1%
parsers:
gcov:
branch_detection:
conditional: yes
loop: yes
method: no
macro: no
threshold: 1% # allow 1% coverage variance
comment:
layout: "header, sunburst, diff"
behavior: default
require_changes: no
layout: "reach, diff, flags, footer"
behavior: once # update if exists; post new; skip if deleted
require_changes: yes # only post when coverage changes
+17 -2
View File
@@ -14,6 +14,7 @@
# Normal rules
#
*.[oa]
*.o.ur-safe
*.lo
*.la
*.mod.c
@@ -21,6 +22,8 @@
*.swp
*.gcno
*.gcda
*.pyc
*.pyo
.deps
.libs
.dirstamp
@@ -41,8 +44,6 @@ Makefile.in
/zfs_config.h.in
/zfs.release
/stamp-h1
/.script-config
/zfs-script-config.sh
/aclocal.m4
/autom4te.cache
@@ -59,3 +60,17 @@ cscope.*
*.tar.gz
*.patch
*.orig
*.log
venv
#
# Module leftovers
#
/module/avl/zavl.mod
/module/icp/icp.mod
/module/lua/zlua.mod
/module/nvpair/znvpair.mod
/module/spl/spl.mod
/module/unicode/zunicode.mod
/module/zcommon/zcommon.mod
/module/zfs/zfs.mod
+38
View File
@@ -0,0 +1,38 @@
language: c
sudo: required
env:
global:
# Travis limits maximum log size, we have to cut tests output
- ZFS_TEST_TRAVIS_LOG_MAX_LENGTH=800
matrix:
# tags are mainly in ascending order
- ZFS_TEST_TAGS='acl,atime,bootfs,cachefile,casenorm,chattr,checksum,clean_mirror,compression,ctime,delegate,devices,events,exec,fault,features,grow_pool,zdb,zfs,zfs_bookmark,zfs_change-key,zfs_clone,zfs_copies,zfs_create,zfs_diff,zfs_get,zfs_inherit,zfs_load-key,zfs_rename'
- ZFS_TEST_TAGS='cache,history,hkdf,inuse,zfs_property,zfs_receive,zfs_reservation,zfs_send,zfs_set,zfs_share,zfs_snapshot,zfs_unload-key,zfs_unmount,zfs_unshare,zfs_upgrade,zpool,zpool_add,zpool_attach,zpool_clear,zpool_create,zpool_destroy,zpool_detach'
- ZFS_TEST_TAGS='grow_replicas,mv_files,cli_user,zfs_mount,zfs_promote,zfs_rollback,zpool_events,zpool_expand,zpool_export,zpool_get,zpool_history,zpool_import,zpool_labelclear,zpool_offline,zpool_online,zpool_remove,zpool_reopen,zpool_replace,zpool_scrub,zpool_set,zpool_status,zpool_sync,zpool_upgrade'
- ZFS_TEST_TAGS='zfs_destroy,large_files,largest_pool,link_count,migration,mmap,mmp,mount,nestedfs,no_space,nopwrite,online_offline,pool_names,poolversion,privilege,quota,raidz,redundancy,rsend'
- ZFS_TEST_TAGS='inheritance,refquota,refreserv,rename_dirs,replacement,reservation,rootpool,scrub_mirror,slog,snapshot,snapused,sparse,threadsappend,tmpfile,truncate,upgrade,userquota,vdev_zaps,write_dirs,xattr,zvol,libzfs'
before_install:
- sudo apt-get -qq update
- sudo apt-get install --yes -qq build-essential autoconf libtool gawk alien fakeroot linux-headers-$(uname -r)
- sudo apt-get install --yes -qq zlib1g-dev uuid-dev libattr1-dev libblkid-dev libselinux-dev libudev-dev libssl-dev
# packages for tests
- sudo apt-get install --yes -qq parted lsscsi ksh attr acl nfs-kernel-server fio
install:
- git clone --depth=1 https://github.com/zfsonlinux/spl
- cd spl
- git checkout master
- sh autogen.sh
- ./configure
- make --no-print-directory -s pkg-utils pkg-kmod
- sudo dpkg -i *.deb
- cd ..
- sh autogen.sh
- ./configure
- make --no-print-directory -s pkg-utils pkg-kmod
- sudo dpkg -i *.deb
script:
- travis_wait 50 /usr/share/zfs/zfs-tests.sh -v -T $ZFS_TEST_TAGS
after_failure:
- find /var/tmp/test_results/current/log -type f -name '*' -printf "%f\n" -exec cut -c -$ZFS_TEST_TRAVIS_LOG_MAX_LENGTH {} \;
after_success:
- find /var/tmp/test_results/current/log -type f -name '*' -printf "%f\n" -exec cut -c -$ZFS_TEST_TRAVIS_LOG_MAX_LENGTH {} \;
+299 -90
View File
@@ -1,95 +1,304 @@
Brian Behlendorf is the principal developer of the ZFS on Linux port.
He works full time as a computer scientist at Lawrence Livermore
National Laboratory on the ZFS and Lustre filesystems. However,
this port would not have been possible without the help of many
others who have contributed their time, effort, and insight.
MAINTAINERS:
Brian Behlendorf <behlendorf1@llnl.gov>
Brian Behlendorf <behlendorf1@llnl.gov>
Tony Hutter <hutter2@llnl.gov>
First and foremost the hard working ZFS developers at Sun/Oracle.
They are responsible for the bulk of the code in this project and
without their efforts there never would have been a ZFS filesystem.
PAST MAINTAINERS:
The ZFS Development Team at Sun/Oracle
Ned Bass <bass6@llnl.gov>
Next all the developers at KQ Infotech who implemented a prototype
ZFS Posix Layer (ZPL). Their implementation provided an excellent
reference for adding the ZPL functionality.
CONTRIBUTORS:
Anand Mitra <mitra@kqinfotech.com>
Anurag Agarwal <anurag@kqinfotech.com>
Neependra Khare <neependra@kqinfotech.com>
Prasad Joshi <prasad@kqinfotech.com>
Rohan Puri <rohan@kqinfotech.com>
Sandip Divekar <sandipd@kqinfotech.com>
Shoaib <shoaib@kqinfotech.com>
Shrirang <shrirang@kqinfotech.com>
Additionally the following individuals have all made contributions
to the project and deserve to be acknowledged.
Albert Lee <trisk@nexenta.com>
Alejandro R. Sedeño <asedeno@mit.edu>
Alex Zhuravlev <bzzz@whamcloud.com>
Alexander Eremin <a.eremin@nexenta.com>
Alexander Stetsenko <ams@nexenta.com>
Alexey Shvetsov <alexxy@gentoo.org>
Andreas Dilger <adilger@whamcloud.com>
Andrew Reid <ColdCanuck@nailedtotheperch.com>
Andrew Stormont <andrew.stormont@nexenta.com>
Andrew Tselischev <andrewtselischev@gmail.com>
Andriy Gapon <avg@FreeBSD.org>
Aniruddha Shankar <k@191a.net>
Bill Pijewski <wdp@joyent.com>
Chris Dunlap <cdunlap@llnl.gov>
Chris Dunlop <chris@onthe.net.au>
Chris Siden <chris.siden@delphix.com>
Chris Wedgwood <cw@f00f.org>
Christian Kohlschütter <christian@kohlschutter.com>
Christopher Siden <chris.siden@delphix.com>
Craig Sanders <github@taz.net.au>
Cyril Plisko <cyril.plisko@mountall.com>
Dan McDonald <danmcd@nexenta.com>
Daniel Verite <daniel@verite.pro>
Darik Horn <dajhorn@vanadac.com>
Eric Schrock <Eric.Schrock@delphix.com>
Etienne Dechamps <etienne.dechamps@ovh.net>
Fajar A. Nugraha <github@fajar.net>
Frederik Wessels <wessels147@gmail.com>
Garrett D'Amore <garrett@nexenta.com>
George Wilson <george.wilson@delphix.com>
Gordon Ross <gwr@nexenta.com>
Gregor Kopka <mailfrom-github.com@kopka.net>
Gunnar Beutner <gunnar@beutner.name>
James H <james@kagisoft.co.uk>
Javen Wu <wu.javen@gmail.com>
Jeremy Gill <jgill@parallax-innovations.com>
Jorgen Lundman <lundman@lundman.net>
KORN Andras <korn@elan.rulez.org>
Kyle Fuller <inbox@kylefuller.co.uk>
Manuel Amador (Rudd-O) <rudd-o@rudd-o.com>
Martin Matuska <mm@FreeBSD.org>
Massimo Maggi <massimo@mmmm.it>
Matthew Ahrens <mahrens@delphix.com>
Michael Martin <mgmartin.mgm@gmail.com>
Mike Harsch <mike@harschsystems.com>
Ned Bass <bass6@llnl.gov>
Oleg Stepura <oleg@stepura.com>
P.SCH <p88@yahoo.com>
Pawel Jakub Dawidek <pawel@dawidek.net>
Prakash Surya <surya1@llnl.gov>
Prasad Joshi <pjoshi@stec-inc.com>
Ricardo M. Correia <Ricardo.M.Correia@Sun.COM>
Richard Laager <rlaager@wiktel.com>
Richard Lowe <richlowe@richlowe.net>
Richard Yao <ryao@cs.stonybrook.edu>
Rohan Puri <rohan.puri15@gmail.com>
Shampavman <sham.pavman@nexenta.com>
Simon Klinkert <klinkert@webgods.de>
Suman Chakravartula <suman@gogrid.com>
Tim Haley <Tim.Haley@Sun.COM>
Turbo Fredriksson <turbo@bayour.com>
Xin Li <delphij@FreeBSD.org>
Yuxuan Shui <yshuiv7@gmail.com>
Zachary Bedell <zac@thebedells.org>
nordaux <nordaux@gmail.com>
Aaron Fineman <abyxcos@gmail.com>
Adam Leventhal <ahl@delphix.com>
Adam Stevko <adam.stevko@gmail.com>
Ahmed G <ahmedg@delphix.com>
Akash Ayare <aayare@delphix.com>
Alan Somers <asomers@gmail.com>
Alar Aun <spamtoaun@gmail.com>
Albert Lee <trisk@nexenta.com>
Alec Salazar <alec.j.salazar@gmail.com>
Alejandro R. Sedeño <asedeno@mit.edu>
Alek Pinchuk <alek@nexenta.com>
Alex Braunegg <alex.braunegg@gmail.com>
Alex McWhirter <alexmcwhirter@triadic.us>
Alex Reece <alex@delphix.com>
Alex Wilson <alex.wilson@joyent.com>
Alex Zhuravlev <alexey.zhuravlev@intel.com>
Alexander Eremin <a.eremin@nexenta.com>
Alexander Motin <mav@freebsd.org>
Alexander Pyhalov <apyhalov@gmail.com>
Alexander Stetsenko <ams@nexenta.com>
Alexey Shvetsov <alexxy@gentoo.org>
Alexey Smirnoff <fling@member.fsf.org>
Allan Jude <allanjude@freebsd.org>
AndCycle <andcycle@andcycle.idv.tw>
Andreas Buschmann <andreas.buschmann@tech.net.de>
Andreas Dilger <adilger@intel.com>
Andrew Barnes <barnes333@gmail.com>
Andrew Hamilton <ahamilto@tjhsst.edu>
Andrew Reid <ColdCanuck@nailedtotheperch.com>
Andrew Stormont <andrew.stormont@nexenta.com>
Andrew Tselischev <andrewtselischev@gmail.com>
Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Andriy Gapon <avg@freebsd.org>
Andy Bakun <github@thwartedefforts.org>
Aniruddha Shankar <k@191a.net>
Antonio Russo <antonio.e.russo@gmail.com>
Arkadiusz Bubała <arkadiusz.bubala@open-e.com>
Arne Jansen <arne@die-jansens.de>
Aron Xu <happyaron.xu@gmail.com>
Bart Coddens <bart.coddens@gmail.com>
Basil Crow <basil.crow@delphix.com>
Huang Liu <liu.huang@zte.com.cn>
Ben Allen <bsallen@alcf.anl.gov>
Ben Rubson <ben.rubson@gmail.com>
Benjamin Albrecht <git@albrecht.io>
Bill McGonigle <bill-github.com-public1@bfccomputing.com>
Bill Pijewski <wdp@joyent.com>
Boris Protopopov <boris.protopopov@nexenta.com>
Brad Lewis <brad.lewis@delphix.com>
Brian Behlendorf <behlendorf1@llnl.gov>
Brian J. Murrell <brian@sun.com>
Caleb James DeLisle <calebdelisle@lavabit.com>
Cao Xuewen <cao.xuewen@zte.com.cn>
Carlo Landmeter <clandmeter@gmail.com>
Carlos Alberto Lopez Perez <clopez@igalia.com>
Chaoyu Zhang <zhang.chaoyu@zte.com.cn>
Chen Can <chen.can2@zte.com.cn>
Chen Haiquan <oc@yunify.com>
Chip Parker <aparker@enthought.com>
Chris Burroughs <chris.burroughs@gmail.com>
Chris Dunlap <cdunlap@llnl.gov>
Chris Dunlop <chris@onthe.net.au>
Chris Siden <chris.siden@delphix.com>
Chris Wedgwood <cw@f00f.org>
Chris Williamson <chris.williamson@delphix.com>
Chris Zubrzycki <github@mid-earth.net>
Christ Schlacta <aarcane@aarcane.info>
Christer Ekholm <che@chrekh.se>
Christian Kohlschütter <christian@kohlschutter.com>
Christian Neukirchen <chneukirchen@gmail.com>
Christian Schwarz <me@cschwarz.com>
Christopher Voltz <cjunk@voltz.ws>
Chunwei Chen <david.chen@nutanix.com>
Clemens Fruhwirth <clemens@endorphin.org>
Colin Ian King <colin.king@canonical.com>
Craig Loomis <cloomis@astro.princeton.edu>
Craig Sanders <github@taz.net.au>
Cyril Plisko <cyril.plisko@infinidat.com>
DHE <git@dehacked.net>
Damian Wojsław <damian@wojslaw.pl>
Dan Kimmel <dan.kimmel@delphix.com>
Dan McDonald <danmcd@nexenta.com>
Dan Swartzendruber <dswartz@druber.com>
Dan Vatca <dan.vatca@gmail.com>
Daniel Hoffman <dj.hoffman@delphix.com>
Daniel Verite <daniel@verite.pro>
Daniil Lunev <d.lunev.mail@gmail.com>
Darik Horn <dajhorn@vanadac.com>
Dave Eddy <dave@daveeddy.com>
David Lamparter <equinox@diac24.net>
David Qian <david.qian@intel.com>
David Quigley <david.quigley@intel.com>
Debabrata Banerjee <dbanerje@akamai.com>
Denys Rtveliashvili <denys@rtveliashvili.name>
Derek Dai <daiderek@gmail.com>
Dimitri John Ledkov <xnox@ubuntu.com>
Dmitry Khasanov <pik4ez@gmail.com>
Dominik Hassler <hadfl@omniosce.org>
Dominik Honnef <dominikh@fork-bomb.org>
Don Brady <don.brady@delphix.com>
Dr. András Korn <korn-github.com@elan.rulez.org>
Eli Rosenthal <eli.rosenthal@delphix.com>
Eric Desrochers <eric.desrochers@canonical.com>
Eric Dillmann <eric@jave.fr>
Eric Schrock <Eric.Schrock@delphix.com>
Etienne Dechamps <etienne@edechamps.fr>
Evan Susarret <evansus@gmail.com>
Fabian Grünbichler <f.gruenbichler@proxmox.com>
Fajar A. Nugraha <github@fajar.net>
Fan Yong <fan.yong@intel.com>
Feng Sun <loyou85@gmail.com>
Frederik Wessels <wessels147@gmail.com>
Frédéric Vanniere <f.vanniere@planet-work.com>
Garrett D'Amore <garrett@nexenta.com>
Garrison Jensen <garrison.jensen@gmail.com>
Gary Mills <gary_mills@fastmail.fm>
Gaurav Kumar <gauravk.18@gmail.com>
GeLiXin <ge.lixin@zte.com.cn>
George Amanakis <g_amanakis@yahoo.com>
George Melikov <mail@gmelikov.ru>
George Wilson <gwilson@delphix.com>
Georgy Yakovlev <ya@sysdump.net>
Giuseppe Di Natale <guss80@gmail.com>
Gordan Bobic <gordan@redsleeve.org>
Gordon Ross <gwr@nexenta.com>
Gregor Kopka <gregor@kopka.net>
Grischa Zengel <github.zfsonlinux@zengel.info>
Gunnar Beutner <gunnar@beutner.name>
Gvozden Neskovic <neskovic@gmail.com>
Hajo Möller <dasjoe@gmail.com>
Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Håkan Johansson <f96hajo@chalmers.se>
Igor Kozhukhov <ikozhukhov@gmail.com>
Igor Lvovsky <ilvovsky@gmail.com>
Isaac Huang <he.huang@intel.com>
JK Dingwall <james@dingwall.me.uk>
Jacek Fefliński <feflik@gmail.com>
James Cowgill <james.cowgill@mips.com>
James Lee <jlee@thestaticvoid.com>
James Pan <jiaming.pan@yahoo.com>
Jan Engelhardt <jengelh@inai.de>
Jan Kryl <jan.kryl@nexenta.com>
Jan Sanislo <oystr@cs.washington.edu>
Jason King <jason.brian.king@gmail.com>
Jason Zaman <jasonzaman@gmail.com>
Javen Wu <wu.javen@gmail.com>
Jeremy Gill <jgill@parallax-innovations.com>
Jeremy Jones <jeremy@delphix.com>
Jerry Jelinek <jerry.jelinek@joyent.com>
Jinshan Xiong <jinshan.xiong@intel.com>
Joe Stein <joe.stein@delphix.com>
John Albietz <inthecloud247@gmail.com>
John Eismeier <john.eismeier@gmail.com>
John L. Hammond <john.hammond@intel.com>
John Layman <jlayman@sagecloud.com>
John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
John Wren Kennedy <john.kennedy@delphix.com>
Johnny Stenback <github@jstenback.com>
Jorgen Lundman <lundman@lundman.net>
Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Joshua M. Clulow <josh@sysmgr.org>
Justin Bedő <cu@cua0.org>
Justin Lecher <jlec@gentoo.org>
Justin T. Gibbs <gibbs@FreeBSD.org>
Jörg Thalheim <joerg@higgsboson.tk>
KORN Andras <korn@elan.rulez.org>
Kamil Domański <kamil@domanski.co>
Karsten Kretschmer <kkretschmer@gmail.com>
Kash Pande <kash@tripleback.net>
Keith M Wesolowski <wesolows@foobazco.org>
Kevin Tanguy <kevin.tanguy@ovh.net>
KireinaHoro <i@jsteward.moe>
Kohsuke Kawaguchi <kk@kohsuke.org>
Kyle Blatter <kyleblatter@llnl.gov>
Kyle Fuller <inbox@kylefuller.co.uk>
Loli <ezomori.nozomu@gmail.com>
Lars Johannsen <laj@it.dk>
Li Dongyang <dongyang.li@anu.edu.au>
Li Wei <W.Li@Sun.COM>
Lukas Wunner <lukas@wunner.de>
Madhav Suresh <madhav.suresh@delphix.com>
Manoj Joseph <manoj.joseph@delphix.com>
Manuel Amador (Rudd-O) <rudd-o@rudd-o.com>
Marcel Huber <marcelhuberfoo@gmail.com>
Marcel Telka <marcel.telka@nexenta.com>
Marcel Wysocki <maci.stgn@gmail.com>
Mark Shellenbaum <Mark.Shellenbaum@Oracle.COM>
Mark Wright <markwright@internode.on.net>
Martin Matuska <mm@FreeBSD.org>
Massimo Maggi <me@massimo-maggi.eu>
Matt Johnston <matt@fugro-fsi.com.au>
Matt Kemp <matt@mattikus.com>
Matthew Ahrens <matt@delphix.com>
Matthew Thode <mthode@mthode.org>
Matus Kral <matuskral@me.com>
Max Grossman <max.grossman@delphix.com>
Maximilian Mehnert <maximilian.mehnert@gmx.de>
Michael Gebetsroither <michael@mgeb.org>
Michael Kjorling <michael@kjorling.se>
Michael Martin <mgmartin.mgm@gmail.com>
Mike Gerdts <mike.gerdts@joyent.com>
Mike Harsch <mike@harschsystems.com>
Mike Leddy <mike.leddy@gmail.com>
Mike Swanson <mikeonthecomputer@gmail.com>
Milan Jurik <milan.jurik@xylab.cz>
Morgan Jones <mjones@rice.edu>
Moritz Maxeiner <moritz@ucworks.org>
Nathaniel Clark <Nathaniel.Clark@misrule.us>
Nathaniel Wesley Filardo <nwf@cs.jhu.edu>
Nav Ravindranath <nav@delphix.com>
Neal Gompa (ニール・ゴンパ) <ngompa13@gmail.com>
Ned Bass <bass6@llnl.gov>
Neependra Khare <neependra@kqinfotech.com>
Neil Stockbridge <neil@dist.ro>
Nick Garvey <garvey.nick@gmail.com>
Nikolay Borisov <n.borisov.lkml@gmail.com>
Olaf Faaland <faaland1@llnl.gov>
Oleg Drokin <green@linuxhacker.ru>
Oleg Stepura <oleg@stepura.com>
Patrik Greco <sikevux@sikevux.se>
Paul B. Henson <henson@acm.org>
Paul Dagnelie <pcd@delphix.com>
Paul Zuchowski <pzuchowski@datto.com>
Pavel Boldin <boldin.pavel@gmail.com>
Pavel Zakharov <pavel.zakharov@delphix.com>
Pawel Jakub Dawidek <pjd@FreeBSD.org>
Pedro Giffuni <pfg@freebsd.org>
Peng <peng.hse@xtaotech.com>
Peter Ashford <ashford@accs.com>
Prakash Surya <prakash.surya@delphix.com>
Prasad Joshi <prasadjoshi124@gmail.com>
Ralf Ertzinger <ralf@skytale.net>
Randall Mason <ClashTheBunny@gmail.com>
Remy Blank <remy.blank@pobox.com>
Ricardo M. Correia <ricardo.correia@oracle.com>
Rich Ercolani <rincebrain@gmail.com>
Richard Elling <Richard.Elling@RichardElling.com>
Richard Laager <rlaager@wiktel.com>
Richard Lowe <richlowe@richlowe.net>
Richard Sharpe <rsharpe@samba.org>
Richard Yao <ryao@gentoo.org>
Rohan Puri <rohan.puri15@gmail.com>
Romain Dolbeau <romain.dolbeau@atos.net>
Roman Strashkin <roman.strashkin@nexenta.com>
Ruben Kerkhof <ruben@rubenkerkhof.com>
Saso Kiselkov <saso.kiselkov@nexenta.com>
Scot W. Stevenson <scot.stevenson@gmail.com>
Sean Eric Fagan <sef@ixsystems.com>
Sen Haerens <sen@senhaerens.be>
Serapheim Dimitropoulos <serapheim@delphix.com>
Seth Forshee <seth.forshee@canonical.com>
Shampavman <sham.pavman@nexenta.com>
Shen Yan <shenyanxxxy@qq.com>
Simon Guest <simon.guest@tesujimath.org>
Simon Klinkert <simon.klinkert@gmail.com>
Sowrabha Gopal <sowrabha.gopal@delphix.com>
Stanislav Seletskiy <s.seletskiy@gmail.com>
Steffen Müthing <steffen.muething@iwr.uni-heidelberg.de>
Stephen Blinick <stephen.blinick@delphix.com>
Steve Dougherty <sdougherty@barracuda.com>
Steven Burgess <sburgess@dattobackup.com>
Steven Hartland <smh@freebsd.org>
Steven Johnson <sjohnson@sakuraindustries.com>
Stian Ellingsen <stian@plaimi.net>
Suman Chakravartula <schakrava@gmail.com>
Sydney Vanda <sydney.m.vanda@intel.com>
Sören Tempel <soeren+git@soeren-tempel.net>
Thijs Cramer <thijs.cramer@gmail.com>
Tim Chase <tim@chase2k.com>
Tim Connors <tconnors@rather.puzzling.org>
Tim Crawford <tcrawford@datto.com>
Tim Haley <Tim.Haley@Sun.COM>
Tobin Harding <me@tobin.cc>
Tom Caputi <tcaputi@datto.com>
Tom Matthews <tom@axiom-partners.com>
Tom Prince <tom.prince@ualberta.net>
Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Tony Hutter <hutter2@llnl.gov>
Toomas Soome <tsoome@me.com>
Trey Dockendorf <treydock@gmail.com>
Turbo Fredriksson <turbo@bayour.com>
Tyler J. Stachecki <stachecki.tyler@gmail.com>
Vitaut Bajaryn <vitaut.bayaryn@gmail.com>
Weigang Li <weigang.li@intel.com>
Will Andrews <will@freebsd.org>
Will Rouesnel <w.rouesnel@gmail.com>
Wolfgang Bumiller <w.bumiller@proxmox.com>
Xin Li <delphij@FreeBSD.org>
Ying Zhu <casualfisher@gmail.com>
YunQiang Su <syq@debian.org>
Yuri Pankov <yuri.pankov@gmail.com>
Yuxuan Shui <yshuiv7@gmail.com>
Zachary Bedell <zac@thebedells.org>
+2
View File
@@ -0,0 +1,2 @@
The [OpenZFS Code of Conduct](http://www.open-zfs.org/wiki/Code_of_Conduct)
applies to spaces associated with the ZFS on Linux project, including GitHub.
+21 -27
View File
@@ -1,33 +1,27 @@
The majority of the code in the ZFS on Linux port comes from OpenSolaris
which has been released under the terms of the CDDL open source license.
This includes the core ZFS code, libavl, libnvpair, libefi, libunicode,
and libutil. The original OpenSolaris source can be downloaded from:
Refer to the git commit log for authoritative copyright attribution.
http://dlc.sun.com/osol/on/downloads/b121/on-src.tar.bz2
The original ZFS source code was obtained from Open Solaris which was
released under the terms of the CDDL open source license. Additional
changes have been included from OpenZFS and the Illumos project which
are similarly licensed. These projects can be found on Github at:
Files which do not originate from OpenSolaris are noted in the file header
and attributed properly. These exceptions include, but are not limited
to, the vdev_disk.c and zvol.c implementation which are licensed under
the CDDL.
The zpios test code is originally derived from the Lustre pios test code
which is licensed under the GPLv2. As such the heavily modified zpios
kernel test code also remains licensed under the GPLv2.
The latest stable and development versions of this port can be downloaded
from the official ZFS on Linux site located at:
http://zfsonlinux.org/
This ZFS on Linux port was produced at the Lawrence Livermore National
Laboratory (LLNL) under Contract No. DE-AC52-07NA27344 (Contract 44)
between the U.S. Department of Energy (DOE) and Lawrence Livermore
National Security, LLC (LLNS) for the operation of LLNL. It has been
approved for release under LLNL-CODE-403049.
* https://github.com/illumos/illumos-gate
* https://github.com/openzfs/openzfs
Unless otherwise noted, all files in this distribution are released
under the Common Development and Distribution License (CDDL).
Exceptions are noted within the associated source files. See the file
OPENSOLARIS.LICENSE for more information.
Refer to the git commit log for authoritative copyright attribution.
Exceptions are noted within the associated source files headers and
by including a THIRDPARTYLICENSE file with the license terms. A few
notable exceptions and their respective licenses include:
* Skein Checksum Implementation: module/icp/algs/skein/THIRDPARTYLICENSE
* AES Implementation: module/icp/asm-x86_64/aes/THIRDPARTYLICENSE.gladman
* AES Implementation: module/icp/asm-x86_64/aes/THIRDPARTYLICENSE.openssl
* PBKDF2 Implementation: lib/libzfs/THIRDPARTYLICENSE.openssl
* SPL Implementation: module/spl/THIRDPARTYLICENSE.gplv2
This product includes software developed by the OpenSSL Project for use
in the OpenSSL Toolkit (http://www.openssl.org/)
See the LICENSE and NOTICE for more information.
-24
View File
@@ -1,24 +0,0 @@
This work was produced at the Lawrence Livermore National Laboratory
(LLNL) under Contract No. DE-AC52-07NA27344 (Contract 44) between
the U.S. Department of Energy (DOE) and Lawrence Livermore National
Security, LLC (LLNS) for the operation of LLNL.
This work was prepared as an account of work sponsored by an agency of
the United States Government. Neither the United States Government nor
Lawrence Livermore National Security, LLC nor any of their employees,
makes any warranty, express or implied, or assumes any liability or
responsibility for the accuracy, completeness, or usefulness of any
information, apparatus, product, or process disclosed, or represents
that its use would not infringe privately-owned rights.
Reference herein to any specific commercial products, process, or
services by trade name, trademark, manufacturer or otherwise does
not necessarily constitute or imply its endorsement, recommendation,
or favoring by the United States Government or Lawrence Livermore
National Security, LLC. The views and opinions of authors expressed
herein do not necessarily state or reflect those of the United States
Government or Lawrence Livermore National Security, LLC, and shall
not be used for advertising or product endorsement purposes.
The precise terms and conditions for copying, distribution, and
modification are specified in the file OPENSOLARIS.LICENSE.
View File
+10 -8
View File
@@ -1,8 +1,10 @@
Meta: 1
Name: zfs
Branch: 1.0
Version: 0.7.13
Release: 1
Release-Tags: relext
License: CDDL
Author: OpenZFS on Linux
Meta: 1
Name: zfs
Branch: 1.0
Version: 0.8.2
Release: 1
Release-Tags: relext
License: CDDL
Author: OpenZFS on Linux
Linux-Maximum: 5.3
Linux-Minimum: 2.6.32
+85 -16
View File
@@ -11,20 +11,34 @@ endif
if CONFIG_KERNEL
SUBDIRS += module
extradir = @prefix@/src/zfs-$(VERSION)
extradir = $(prefix)/src/zfs-$(VERSION)
extra_HEADERS = zfs.release.in zfs_config.h.in
kerneldir = @prefix@/src/zfs-$(VERSION)/$(LINUX_VERSION)
kerneldir = $(prefix)/src/zfs-$(VERSION)/$(LINUX_VERSION)
nodist_kernel_HEADERS = zfs.release zfs_config.h module/$(LINUX_SYMBOLS)
endif
AUTOMAKE_OPTIONS = foreign
EXTRA_DIST = autogen.sh copy-builtin
EXTRA_DIST += config/config.awk config/rpm.am config/deb.am config/tgz.am
EXTRA_DIST += META DISCLAIMER COPYRIGHT README.markdown OPENSOLARIS.LICENSE
EXTRA_DIST += META AUTHORS COPYRIGHT LICENSE NEWS NOTICE README.md
EXTRA_DIST += CODE_OF_CONDUCT.md
# Include all the extra licensing information for modules
EXTRA_DIST += module/icp/algs/skein/THIRDPARTYLICENSE module/icp/algs/skein/THIRDPARTYLICENSE.descrip
EXTRA_DIST += module/icp/asm-x86_64/aes/THIRDPARTYLICENSE.gladman module/icp/asm-x86_64/aes/THIRDPARTYLICENSE.gladman.descrip
EXTRA_DIST += module/icp/asm-x86_64/aes/THIRDPARTYLICENSE.openssl module/icp/asm-x86_64/aes/THIRDPARTYLICENSE.openssl.descrip
EXTRA_DIST += module/spl/THIRDPARTYLICENSE.gplv2 module/spl/THIRDPARTYLICENSE.gplv2.descrip
EXTRA_DIST += module/zfs/THIRDPARTYLICENSE.cityhash module/zfs/THIRDPARTYLICENSE.cityhash.descrip
@CODE_COVERAGE_RULES@
.PHONY: gitrev
gitrev:
-${top_srcdir}/scripts/make_gitrev.sh
BUILT_SOURCES = gitrev
distclean-local::
-$(RM) -R autom4te*.cache
-find . \( -name SCCS -o -name BitKeeper -o -name .svn -o -name CVS \
@@ -37,30 +51,79 @@ distclean-local::
-o -name '*.gcno' \) \
-type f -print | xargs $(RM)
dist-hook:
all-local:
-[ -x ${top_builddir}/scripts/zfs-tests.sh ] && \
${top_builddir}/scripts/zfs-tests.sh -c
dist-hook: gitrev
cp ${top_srcdir}/include/zfs_gitrev.h $(distdir)/include; \
sed -i 's/Release:[[:print:]]*/Release: $(RELEASE)/' \
$(distdir)/META
checkstyle: cstyle shellcheck flake8 commitcheck
# For compatibility, create a matching spl-x.y.z directly which contains
# symlinks to the updated header and object file locations. These
# compatibility links will be removed in the next major release.
if CONFIG_KERNEL
install-data-hook:
rm -rf $(DESTDIR)$(prefix)/src/spl-$(VERSION) && \
mkdir $(DESTDIR)$(prefix)/src/spl-$(VERSION) && \
cd $(DESTDIR)$(prefix)/src/spl-$(VERSION) && \
ln -s ../zfs-$(VERSION)/include/spl include && \
ln -s ../zfs-$(VERSION)/$(LINUX_VERSION) $(LINUX_VERSION) && \
ln -s ../zfs-$(VERSION)/zfs_config.h.in spl_config.h.in && \
ln -s ../zfs-$(VERSION)/zfs.release.in spl.release.in && \
cd $(DESTDIR)$(prefix)/src/zfs-$(VERSION)/$(LINUX_VERSION) && \
ln -fs zfs_config.h spl_config.h && \
ln -fs zfs.release spl.release
endif
codecheck: cstyle shellcheck flake8 mancheck testscheck vcscheck
checkstyle: codecheck commitcheck
commitcheck:
@if git rev-parse --git-dir > /dev/null 2>&1; then \
scripts/commitcheck.sh; \
${top_srcdir}/scripts/commitcheck.sh; \
fi
cstyle:
@find ${top_srcdir} -name '*.[hc]' ! -name 'zfs_config.*' \
! -name '*.mod.c' -type f -exec scripts/cstyle.pl -cpP {} \+
! -name '*.mod.c' -type f \
-exec ${top_srcdir}/scripts/cstyle.pl -cpP {} \+
shellcheck:
@if type shellcheck > /dev/null 2>&1; then \
shellcheck --exclude=SC1090 --format=gcc scripts/paxcheck.sh \
scripts/zloop.sh \
scripts/zfs-tests.sh \
scripts/zfs.sh \
scripts/commitcheck.sh \
$$(find cmd/zed/zed.d/*.sh -type f) \
$$(find cmd/zpool/zpool.d/* -executable); \
shellcheck --exclude=SC1090 --format=gcc \
$$(find ${top_srcdir}/scripts/*.sh -type f) \
$$(find ${top_srcdir}/cmd/zed/zed.d/*.sh -type f) \
$$(find ${top_srcdir}/cmd/zpool/zpool.d/* -executable); \
else \
echo "skipping shellcheck because shellcheck is not installed"; \
fi
mancheck:
@if type mandoc > /dev/null 2>&1; then \
find ${top_srcdir}/man/man8 -type f -name 'zfs.8' \
-o -name 'zpool.8' -o -name 'zdb.8' \
-o -name 'zgenhostid.8' | \
xargs mandoc -Tlint -Werror; \
else \
echo "skipping mancheck because mandoc is not installed"; \
fi
testscheck:
@find ${top_srcdir}/tests/zfs-tests -type f \
\( -name '*.ksh' -not -executable \) -o \
\( -name '*.kshlib' -executable \) -o \
\( -name '*.shlib' -executable \) -o \
\( -name '*.cfg' -executable \) | \
xargs -r stat -c '%A %n' | \
awk '{c++; print} END {if(c>0) exit 1}'
vcscheck:
@if git rev-parse --git-dir > /dev/null 2>&1; then \
git ls-files . --exclude-standard --others | \
awk '{c++; print} END {if(c>0) exit 1}' ; \
fi
lint: cppcheck paxcheck
@@ -70,17 +133,23 @@ cppcheck:
cppcheck --quiet --force --error-exitcode=2 --inline-suppr \
--suppressions-list=.github/suppressions.txt \
-UHAVE_SSE2 -UHAVE_AVX512F -UHAVE_UIO_ZEROCOPY \
-UHAVE_DNLC ${top_srcdir}; \
${top_srcdir}; \
else \
echo "skipping cppcheck because cppcheck is not installed"; \
fi
paxcheck:
@if type scanelf > /dev/null 2>&1; then \
scripts/paxcheck.sh ${top_srcdir}; \
${top_srcdir}/scripts/paxcheck.sh ${top_srcdir}; \
else \
echo "skipping paxcheck because scanelf is not installed"; \
fi
flake8:
@if type flake8 > /dev/null 2>&1; then \
flake8 ${top_srcdir}; \
else \
echo "skipping flake8 because flake8 is not installed"; \
fi
ctags:
+3
View File
@@ -0,0 +1,3 @@
Descriptions of all releases can be found on github:
https://github.com/zfsonlinux/zfs/releases
+16
View File
@@ -0,0 +1,16 @@
This work was produced under the auspices of the U.S. Department of Energy by
Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
This work was prepared as an account of work sponsored by an agency of the
United States Government. Neither the United States Government nor Lawrence
Livermore National Security, LLC, nor any of their employees makes any warranty,
expressed or implied, or assumes any legal liability or responsibility for the
accuracy, completeness, or usefulness of any information, apparatus, product, or
process disclosed, or represents that its use would not infringe privately owned
rights. Reference herein to any specific commercial product, process, or service
by trade name, trademark, manufacturer, or otherwise does not necessarily
constitute or imply its endorsement, recommendation, or favoring by the United
States Government or Lawrence Livermore National Security, LLC. The views and
opinions of authors expressed herein do not necessarily state or reflect those
of the United States Government or Lawrence Livermore National Security, LLC,
and shall not be used for advertising or product endorsement purposes.
+12
View File
@@ -4,16 +4,28 @@ ZFS on Linux is an advanced file system and volume manager which was originally
developed for Solaris and is now maintained by the OpenZFS community.
[![codecov](https://codecov.io/gh/zfsonlinux/zfs/branch/master/graph/badge.svg)](https://codecov.io/gh/zfsonlinux/zfs)
[![coverity](https://scan.coverity.com/projects/1973/badge.svg)](https://scan.coverity.com/projects/zfsonlinux-zfs)
# Official Resources
* [Site](http://zfsonlinux.org)
* [Wiki](https://github.com/zfsonlinux/zfs/wiki)
* [Mailing lists](https://github.com/zfsonlinux/zfs/wiki/Mailing-Lists)
* [OpenZFS site](http://open-zfs.org/)
# Installation
Full documentation for installing ZoL on your favorite Linux distribution can
be found at [our site](http://zfsonlinux.org/).
# Contribute & Develop
We have a separate document with [contribution guidelines](./.github/CONTRIBUTING.md).
# Release
ZFS on Linux is released under a CDDL license.
For more details see the NOTICE, LICENSE and COPYRIGHT files; `UCRL-CODE-235197`
# Supported Kernels
* The `META` file contains the officially recognized supported kernel versions.
+22 -5
View File
@@ -1,17 +1,15 @@
#!/bin/sh
### prepare
#TEST_PREPARE_WATCHDOG="no"
### SPLAT
#TEST_SPLAT_SKIP="yes"
#TEST_SPLAT_OPTIONS="-acvx"
#TEST_PREPARE_WATCHDOG="yes"
#TEST_PREPARE_SHARES="yes"
### ztest
#TEST_ZTEST_SKIP="yes"
#TEST_ZTEST_TIMEOUT=1800
#TEST_ZTEST_DIR="/var/tmp/"
#TEST_ZTEST_OPTIONS="-V"
#TEST_ZTEST_CORE_DIR="/mnt/zloop"
### zimport
#TEST_ZIMPORT_SKIP="yes"
@@ -31,9 +29,13 @@
### zfs-tests.sh
#TEST_ZFSTESTS_SKIP="yes"
#TEST_ZFSTESTS_DIR="/mnt/"
#TEST_ZFSTESTS_DISKS="vdb vdc vdd"
#TEST_ZFSTESTS_DISKSIZE="8G"
#TEST_ZFSTESTS_ITERS="1"
#TEST_ZFSTESTS_OPTIONS="-vx"
#TEST_ZFSTESTS_RUNFILE="linux.run"
#TEST_ZFSTESTS_TAGS="functional"
### zfsstress
#TEST_ZFSSTRESS_SKIP="yes"
@@ -42,6 +44,7 @@
#TEST_ZFSSTRESS_RUNTIME=300
#TEST_ZFSSTRESS_POOL="tank"
#TEST_ZFSSTRESS_FS="fish"
#TEST_ZFSSTRESS_FSOPT="-o overlay=on"
#TEST_ZFSSTRESS_VDEV="/var/tmp/vdev"
#TEST_ZFSSTRESS_DIR="/$TEST_ZFSSTRESS_POOL/$TEST_ZFSSTRESS_FS"
#TEST_ZFSSTRESS_OPTIONS=""
@@ -83,6 +86,20 @@ Ubuntu*)
;;
esac
###
#
# Run ztest longer on the "coverage" builders to gain more code coverage
# data out of ztest, libzpool, etc.
#
case "$BB_NAME" in
*coverage*)
TEST_ZTEST_TIMEOUT=3600
;;
*)
TEST_ZTEST_TIMEOUT=900
;;
esac
###
#
# Disable the following test suites on 32-bit systems.
+1 -1
View File
@@ -1,4 +1,4 @@
#!/bin/sh
autoreconf -fiv
autoreconf -fiv || exit 1
rm -Rf autom4te.cache
+8 -3
View File
@@ -1,3 +1,8 @@
SUBDIRS = zfs zpool zdb zhack zinject zstreamdump ztest zpios
SUBDIRS += mount_zfs fsck_zfs zvol_id vdev_id arcstat dbufstat zed
SUBDIRS += arc_summary raidz_test zgenhostid
SUBDIRS = zfs zpool zdb zhack zinject zstreamdump ztest
SUBDIRS += fsck_zfs vdev_id raidz_test zgenhostid
if USING_PYTHON
SUBDIRS += arcstat arc_summary dbufstat
endif
SUBDIRS += mount_zfs zed zvol_id zvol_wait
+11 -1
View File
@@ -1 +1,11 @@
dist_bin_SCRIPTS = arc_summary.py
EXTRA_DIST = arc_summary2 arc_summary3
if USING_PYTHON_2
dist_bin_SCRIPTS = arc_summary2
install-exec-hook:
mv $(DESTDIR)$(bindir)/arc_summary2 $(DESTDIR)$(bindir)/arc_summary
else
dist_bin_SCRIPTS = arc_summary3
install-exec-hook:
mv $(DESTDIR)$(bindir)/arc_summary3 $(DESTDIR)$(bindir)/arc_summary
endif
@@ -1,4 +1,4 @@
#!/usr/bin/python
#!/usr/bin/python2
#
# $Id: arc_summary.pl,v 388:e27800740aa2 2011-07-08 02:53:29Z jhell $
#
@@ -35,6 +35,8 @@
# Note some of this code uses older code (eg getopt instead of argparse,
# subprocess.Popen() instead of subprocess.run()) because we need to support
# some very old versions of Python.
#
"""Print statistics on the ZFS Adjustable Replacement Cache (ARC)
Provides basic information on the ARC, its efficiency, the L2ARC (if present),
@@ -204,6 +206,10 @@ def get_arc_summary(Kstat):
arc_size = Kstat["kstat.zfs.misc.arcstats.size"]
mru_size = Kstat["kstat.zfs.misc.arcstats.mru_size"]
mfu_size = Kstat["kstat.zfs.misc.arcstats.mfu_size"]
meta_limit = Kstat["kstat.zfs.misc.arcstats.arc_meta_limit"]
meta_size = Kstat["kstat.zfs.misc.arcstats.arc_meta_used"]
dnode_limit = Kstat["kstat.zfs.misc.arcstats.arc_dnode_limit"]
dnode_size = Kstat["kstat.zfs.misc.arcstats.dnode_size"]
target_max_size = Kstat["kstat.zfs.misc.arcstats.c_max"]
target_min_size = Kstat["kstat.zfs.misc.arcstats.c_min"]
target_size = Kstat["kstat.zfs.misc.arcstats.c"]
@@ -228,6 +234,22 @@ def get_arc_summary(Kstat):
'per': fPerc(target_size, target_max_size),
'num': fBytes(target_size),
}
output['arc_sizing']['meta_limit'] = {
'per': fPerc(meta_limit, target_max_size),
'num': fBytes(meta_limit),
}
output['arc_sizing']['meta_size'] = {
'per': fPerc(meta_size, meta_limit),
'num': fBytes(meta_size),
}
output['arc_sizing']['dnode_limit'] = {
'per': fPerc(dnode_limit, meta_limit),
'num': fBytes(dnode_limit),
}
output['arc_sizing']['dnode_size'] = {
'per': fPerc(dnode_size, dnode_limit),
'num': fBytes(dnode_size),
}
# ARC Hash Breakdown
output['arc_hash_break'] = {}
@@ -333,6 +355,26 @@ def _arc_summary(Kstat):
arc['arc_size_break']['frequently_used_cache_size']['num'],
)
)
sys.stdout.write("\tMetadata Size (Hard Limit):\t%s\t%s\n" % (
arc['arc_sizing']['meta_limit']['per'],
arc['arc_sizing']['meta_limit']['num'],
)
)
sys.stdout.write("\tMetadata Size:\t\t\t%s\t%s\n" % (
arc['arc_sizing']['meta_size']['per'],
arc['arc_sizing']['meta_size']['num'],
)
)
sys.stdout.write("\tDnode Size (Hard Limit):\t%s\t%s\n" % (
arc['arc_sizing']['dnode_limit']['per'],
arc['arc_sizing']['dnode_limit']['num'],
)
)
sys.stdout.write("\tDnode Size:\t\t\t%s\t%s\n" % (
arc['arc_sizing']['dnode_size']['per'],
arc['arc_sizing']['dnode_size']['num'],
)
)
sys.stdout.write("\n")
@@ -965,7 +1007,7 @@ def zfs_header():
def usage():
"""Print usage information"""
sys.stdout.write("Usage: arc_summary.py [-h] [-a] [-d] [-p PAGE]\n\n")
sys.stdout.write("Usage: arc_summary [-h] [-a] [-d] [-p PAGE]\n\n")
sys.stdout.write("\t -h, --help : "
"Print this help message and exit\n")
sys.stdout.write("\t -a, --alternate : "
@@ -978,10 +1020,10 @@ def usage():
"should be an integer between 1 and " +
str(len(unSub)) + "\n\n")
sys.stdout.write("Examples:\n")
sys.stdout.write("\tarc_summary.py -a\n")
sys.stdout.write("\tarc_summary.py -p 4\n")
sys.stdout.write("\tarc_summary.py -ad\n")
sys.stdout.write("\tarc_summary.py --page=2\n")
sys.stdout.write("\tarc_summary -a\n")
sys.stdout.write("\tarc_summary -p 4\n")
sys.stdout.write("\tarc_summary -ad\n")
sys.stdout.write("\tarc_summary --page=2\n")
def main():
+875
View File
@@ -0,0 +1,875 @@
#!/usr/bin/python3
#
# Copyright (c) 2008 Ben Rockwood <benr@cuddletech.com>,
# Copyright (c) 2010 Martin Matuska <mm@FreeBSD.org>,
# Copyright (c) 2010-2011 Jason J. Hellenthal <jhell@DataIX.net>,
# Copyright (c) 2017 Scot W. Stevenson <scot.stevenson@gmail.com>
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
#
# 1. Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
# 2. Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
#
# THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED. IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
# OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
# HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
# LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
# OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
# SUCH DAMAGE.
"""Print statistics on the ZFS ARC Cache and other information
Provides basic information on the ARC, its efficiency, the L2ARC (if present),
the Data Management Unit (DMU), Virtual Devices (VDEVs), and tunables. See
the in-source documentation and code at
https://github.com/zfsonlinux/zfs/blob/master/module/zfs/arc.c for details.
The original introduction to arc_summary can be found at
http://cuddletech.com/?p=454
"""
import argparse
import os
import subprocess
import sys
import time
DECRIPTION = 'Print ARC and other statistics for ZFS on Linux'
INDENT = ' '*8
LINE_LENGTH = 72
PROC_PATH = '/proc/spl/kstat/zfs/'
SPL_PATH = '/sys/module/spl/parameters/'
TUNABLES_PATH = '/sys/module/zfs/parameters/'
DATE_FORMAT = '%a %b %d %H:%M:%S %Y'
TITLE = 'ZFS Subsystem Report'
SECTIONS = 'arc archits dmu l2arc spl tunables vdev zil'.split()
SECTION_HELP = 'print info from one section ('+' '.join(SECTIONS)+')'
# Tunables and SPL are handled separately because they come from
# different sources
SECTION_PATHS = {'arc': 'arcstats',
'dmu': 'dmu_tx',
'l2arc': 'arcstats', # L2ARC stuff lives in arcstats
'vdev': 'vdev_cache_stats',
'xuio': 'xuio_stats',
'zfetch': 'zfetchstats',
'zil': 'zil'}
parser = argparse.ArgumentParser(description=DECRIPTION)
parser.add_argument('-a', '--alternate', action='store_true', default=False,
help='use alternate formatting for tunables and SPL',
dest='alt')
parser.add_argument('-d', '--description', action='store_true', default=False,
help='print descriptions with tunables and SPL',
dest='desc')
parser.add_argument('-g', '--graph', action='store_true', default=False,
help='print graph on ARC use and exit', dest='graph')
parser.add_argument('-p', '--page', type=int, dest='page',
help='print page by number (DEPRECATED, use "-s")')
parser.add_argument('-r', '--raw', action='store_true', default=False,
help='dump all available data with minimal formatting',
dest='raw')
parser.add_argument('-s', '--section', dest='section', help=SECTION_HELP)
ARGS = parser.parse_args()
def cleanup_line(single_line):
"""Format a raw line of data from /proc and isolate the name value
part, returning a tuple with each. Currently, this gets rid of the
middle '4'. For example "arc_no_grow 4 0" returns the tuple
("arc_no_grow", "0").
"""
name, _, value = single_line.split()
return name, value
def draw_graph(kstats_dict):
"""Draw a primitive graph representing the basic information on the
ARC -- its size and the proportion used by MFU and MRU -- and quit.
We use max size of the ARC to calculate how full it is. This is a
very rough representation.
"""
arc_stats = isolate_section('arcstats', kstats_dict)
GRAPH_INDENT = ' '*4
GRAPH_WIDTH = 60
arc_size = f_bytes(arc_stats['size'])
arc_perc = f_perc(arc_stats['size'], arc_stats['c_max'])
mfu_size = f_bytes(arc_stats['mfu_size'])
mru_size = f_bytes(arc_stats['mru_size'])
meta_limit = f_bytes(arc_stats['arc_meta_limit'])
meta_size = f_bytes(arc_stats['arc_meta_used'])
dnode_limit = f_bytes(arc_stats['arc_dnode_limit'])
dnode_size = f_bytes(arc_stats['dnode_size'])
info_form = ('ARC: {0} ({1}) MFU: {2} MRU: {3} META: {4} ({5}) '
'DNODE {6} ({7})')
info_line = info_form.format(arc_size, arc_perc, mfu_size, mru_size,
meta_size, meta_limit, dnode_size,
dnode_limit)
info_spc = ' '*int((GRAPH_WIDTH-len(info_line))/2)
info_line = GRAPH_INDENT+info_spc+info_line
graph_line = GRAPH_INDENT+'+'+('-'*(GRAPH_WIDTH-2))+'+'
mfu_perc = float(int(arc_stats['mfu_size'])/int(arc_stats['c_max']))
mru_perc = float(int(arc_stats['mru_size'])/int(arc_stats['c_max']))
arc_perc = float(int(arc_stats['size'])/int(arc_stats['c_max']))
total_ticks = float(arc_perc)*GRAPH_WIDTH
mfu_ticks = mfu_perc*GRAPH_WIDTH
mru_ticks = mru_perc*GRAPH_WIDTH
other_ticks = total_ticks-(mfu_ticks+mru_ticks)
core_form = 'F'*int(mfu_ticks)+'R'*int(mru_ticks)+'O'*int(other_ticks)
core_spc = ' '*(GRAPH_WIDTH-(2+len(core_form)))
core_line = GRAPH_INDENT+'|'+core_form+core_spc+'|'
for line in ('', info_line, graph_line, core_line, graph_line, ''):
print(line)
def f_bytes(byte_string):
"""Return human-readable representation of a byte value in
powers of 2 (eg "KiB" for "kibibytes", etc) to two decimal
points. Values smaller than one KiB are returned without
decimal points. Note "bytes" is a reserved keyword.
"""
prefixes = ([2**80, "YiB"], # yobibytes (yotta)
[2**70, "ZiB"], # zebibytes (zetta)
[2**60, "EiB"], # exbibytes (exa)
[2**50, "PiB"], # pebibytes (peta)
[2**40, "TiB"], # tebibytes (tera)
[2**30, "GiB"], # gibibytes (giga)
[2**20, "MiB"], # mebibytes (mega)
[2**10, "KiB"]) # kibibytes (kilo)
bites = int(byte_string)
if bites >= 2**10:
for limit, unit in prefixes:
if bites >= limit:
value = bites / limit
break
result = '{0:.1f} {1}'.format(value, unit)
else:
result = '{0} Bytes'.format(bites)
return result
def f_hits(hits_string):
"""Create a human-readable representation of the number of hits.
The single-letter symbols used are SI to avoid the confusion caused
by the different "short scale" and "long scale" representations in
English, which use the same words for different values. See
https://en.wikipedia.org/wiki/Names_of_large_numbers and:
https://physics.nist.gov/cuu/Units/prefixes.html
"""
numbers = ([10**24, 'Y'], # yotta (septillion)
[10**21, 'Z'], # zetta (sextillion)
[10**18, 'E'], # exa (quintrillion)
[10**15, 'P'], # peta (quadrillion)
[10**12, 'T'], # tera (trillion)
[10**9, 'G'], # giga (billion)
[10**6, 'M'], # mega (million)
[10**3, 'k']) # kilo (thousand)
hits = int(hits_string)
if hits >= 1000:
for limit, symbol in numbers:
if hits >= limit:
value = hits/limit
break
result = "%0.1f%s" % (value, symbol)
else:
result = "%d" % hits
return result
def f_perc(value1, value2):
"""Calculate percentage and return in human-readable form. If
rounding produces the result '0.0' though the first number is
not zero, include a 'less-than' symbol to avoid confusion.
Division by zero is handled by returning 'n/a'; no error
is called.
"""
v1 = float(value1)
v2 = float(value2)
try:
perc = 100 * v1/v2
except ZeroDivisionError:
result = 'n/a'
else:
result = '{0:0.1f} %'.format(perc)
if result == '0.0 %' and v1 > 0:
result = '< 0.1 %'
return result
def format_raw_line(name, value):
"""For the --raw option for the tunable and SPL outputs, decide on the
correct formatting based on the --alternate flag.
"""
if ARGS.alt:
result = '{0}{1}={2}'.format(INDENT, name, value)
else:
spc = LINE_LENGTH-(len(INDENT)+len(value))
result = '{0}{1:<{spc}}{2}'.format(INDENT, name, value, spc=spc)
return result
def get_kstats():
"""Collect information on the ZFS subsystem from the /proc Linux virtual
file system. The step does not perform any further processing, giving us
the option to only work on what is actually needed. The name "kstat" is a
holdover from the Solaris utility of the same name.
"""
result = {}
secs = SECTION_PATHS.values()
for section in secs:
with open(PROC_PATH+section, 'r') as proc_location:
lines = [line for line in proc_location]
del lines[0:2] # Get rid of header
result[section] = lines
return result
def get_spl_tunables(PATH):
"""Collect information on the Solaris Porting Layer (SPL) or the
tunables, depending on the PATH given. Does not check if PATH is
legal.
"""
result = {}
parameters = os.listdir(PATH)
for name in parameters:
with open(PATH+name, 'r') as para_file:
value = para_file.read()
result[name] = value.strip()
return result
def get_descriptions(request):
"""Get the decriptions of the Solaris Porting Layer (SPL) or the
tunables, return with minimal formatting.
"""
if request not in ('spl', 'zfs'):
print('ERROR: description of "{0}" requested)'.format(request))
sys.exit(1)
descs = {}
target_prefix = 'parm:'
# We would prefer to do this with /sys/modules -- see the discussion at
# get_version() -- but there isn't a way to get the descriptions from
# there, so we fall back on modinfo
command = ["/sbin/modinfo", request, "-0"]
# The recommended way to do this is with subprocess.run(). However,
# some installed versions of Python are < 3.5, so we offer them
# the option of doing it the old way (for now)
info = ''
try:
if 'run' in dir(subprocess):
info = subprocess.run(command, stdout=subprocess.PIPE,
universal_newlines=True)
raw_output = info.stdout.split('\0')
else:
info = subprocess.check_output(command, universal_newlines=True)
raw_output = info.split('\0')
except subprocess.CalledProcessError:
print("Error: Descriptions not available (can't access kernel module)")
sys.exit(1)
for line in raw_output:
if not line.startswith(target_prefix):
continue
line = line[len(target_prefix):].strip()
name, raw_desc = line.split(':', 1)
desc = raw_desc.rsplit('(', 1)[0]
if desc == '':
desc = '(No description found)'
descs[name.strip()] = desc.strip()
return descs
def get_version(request):
"""Get the version number of ZFS or SPL on this machine for header.
Returns an error string, but does not raise an error, if we can't
get the ZFS/SPL version via modinfo.
"""
if request not in ('spl', 'zfs'):
error_msg = '(ERROR: "{0}" requested)'.format(request)
return error_msg
# The original arc_summary called /sbin/modinfo/{spl,zfs} to get
# the version information. We switch to /sys/module/{spl,zfs}/version
# to make sure we get what is really loaded in the kernel
command = ["cat", "/sys/module/{0}/version".format(request)]
req = request.upper()
version = "(Can't get {0} version)".format(req)
# The recommended way to do this is with subprocess.run(). However,
# some installed versions of Python are < 3.5, so we offer them
# the option of doing it the old way (for now)
info = ''
if 'run' in dir(subprocess):
info = subprocess.run(command, stdout=subprocess.PIPE,
universal_newlines=True)
version = info.stdout.strip()
else:
info = subprocess.check_output(command, universal_newlines=True)
version = info.strip()
return version
def print_header():
"""Print the initial heading with date and time as well as info on the
Linux and ZFS versions. This is not called for the graph.
"""
# datetime is now recommended over time but we keep the exact formatting
# from the older version of arc_summary in case there are scripts
# that expect it in this way
daydate = time.strftime(DATE_FORMAT)
spc_date = LINE_LENGTH-len(daydate)
sys_version = os.uname()
sys_msg = sys_version.sysname+' '+sys_version.release
zfs = get_version('zfs')
spc_zfs = LINE_LENGTH-len(zfs)
machine_msg = 'Machine: '+sys_version.nodename+' ('+sys_version.machine+')'
spl = get_version('spl')
spc_spl = LINE_LENGTH-len(spl)
print('\n'+('-'*LINE_LENGTH))
print('{0:<{spc}}{1}'.format(TITLE, daydate, spc=spc_date))
print('{0:<{spc}}{1}'.format(sys_msg, zfs, spc=spc_zfs))
print('{0:<{spc}}{1}\n'.format(machine_msg, spl, spc=spc_spl))
def print_raw(kstats_dict):
"""Print all available data from the system in a minimally sorted format.
This can be used as a source to be piped through 'grep'.
"""
sections = sorted(kstats_dict.keys())
for section in sections:
print('\n{0}:'.format(section.upper()))
lines = sorted(kstats_dict[section])
for line in lines:
name, value = cleanup_line(line)
print(format_raw_line(name, value))
# Tunables and SPL must be handled separately because they come from a
# different source and have descriptions the user might request
print()
section_spl()
section_tunables()
def isolate_section(section_name, kstats_dict):
"""From the complete information on all sections, retrieve only those
for one section.
"""
try:
section_data = kstats_dict[section_name]
except KeyError:
print('ERROR: Data on {0} not available'.format(section_data))
sys.exit(1)
section_dict = dict(cleanup_line(l) for l in section_data)
return section_dict
# Formatted output helper functions
def prt_1(text, value):
"""Print text and one value, no indent"""
spc = ' '*(LINE_LENGTH-(len(text)+len(value)))
print('{0}{spc}{1}'.format(text, value, spc=spc))
def prt_i1(text, value):
"""Print text and one value, with indent"""
spc = ' '*(LINE_LENGTH-(len(INDENT)+len(text)+len(value)))
print(INDENT+'{0}{spc}{1}'.format(text, value, spc=spc))
def prt_2(text, value1, value2):
"""Print text and two values, no indent"""
values = '{0:>9} {1:>9}'.format(value1, value2)
spc = ' '*(LINE_LENGTH-(len(text)+len(values)+2))
print('{0}{spc} {1}'.format(text, values, spc=spc))
def prt_i2(text, value1, value2):
"""Print text and two values, with indent"""
values = '{0:>9} {1:>9}'.format(value1, value2)
spc = ' '*(LINE_LENGTH-(len(INDENT)+len(text)+len(values)+2))
print(INDENT+'{0}{spc} {1}'.format(text, values, spc=spc))
# The section output concentrates on important parameters instead of
# being exhaustive (that is what the --raw parameter is for)
def section_arc(kstats_dict):
"""Give basic information on the ARC, MRU and MFU. This is the first
and most used section.
"""
arc_stats = isolate_section('arcstats', kstats_dict)
throttle = arc_stats['memory_throttle_count']
if throttle == '0':
health = 'HEALTHY'
else:
health = 'THROTTLED'
prt_1('ARC status:', health)
prt_i1('Memory throttle count:', throttle)
print()
arc_size = arc_stats['size']
arc_target_size = arc_stats['c']
arc_max = arc_stats['c_max']
arc_min = arc_stats['c_min']
mfu_size = arc_stats['mfu_size']
mru_size = arc_stats['mru_size']
meta_limit = arc_stats['arc_meta_limit']
meta_size = arc_stats['arc_meta_used']
dnode_limit = arc_stats['arc_dnode_limit']
dnode_size = arc_stats['dnode_size']
target_size_ratio = '{0}:1'.format(int(arc_max) // int(arc_min))
prt_2('ARC size (current):',
f_perc(arc_size, arc_max), f_bytes(arc_size))
prt_i2('Target size (adaptive):',
f_perc(arc_target_size, arc_max), f_bytes(arc_target_size))
prt_i2('Min size (hard limit):',
f_perc(arc_min, arc_max), f_bytes(arc_min))
prt_i2('Max size (high water):',
target_size_ratio, f_bytes(arc_max))
caches_size = int(mfu_size)+int(mru_size)
prt_i2('Most Frequently Used (MFU) cache size:',
f_perc(mfu_size, caches_size), f_bytes(mfu_size))
prt_i2('Most Recently Used (MRU) cache size:',
f_perc(mru_size, caches_size), f_bytes(mru_size))
prt_i2('Metadata cache size (hard limit):',
f_perc(meta_limit, arc_max), f_bytes(meta_limit))
prt_i2('Metadata cache size (current):',
f_perc(meta_size, meta_limit), f_bytes(meta_size))
prt_i2('Dnode cache size (hard limit):',
f_perc(dnode_limit, meta_limit), f_bytes(dnode_limit))
prt_i2('Dnode cache size (current):',
f_perc(dnode_size, dnode_limit), f_bytes(dnode_size))
print()
print('ARC hash breakdown:')
prt_i1('Elements max:', f_hits(arc_stats['hash_elements_max']))
prt_i2('Elements current:',
f_perc(arc_stats['hash_elements'], arc_stats['hash_elements_max']),
f_hits(arc_stats['hash_elements']))
prt_i1('Collisions:', f_hits(arc_stats['hash_collisions']))
prt_i1('Chain max:', f_hits(arc_stats['hash_chain_max']))
prt_i1('Chains:', f_hits(arc_stats['hash_chains']))
print()
print('ARC misc:')
prt_i1('Deleted:', f_hits(arc_stats['deleted']))
prt_i1('Mutex misses:', f_hits(arc_stats['mutex_miss']))
prt_i1('Eviction skips:', f_hits(arc_stats['evict_skip']))
print()
def section_archits(kstats_dict):
"""Print information on how the caches are accessed ("arc hits").
"""
arc_stats = isolate_section('arcstats', kstats_dict)
all_accesses = int(arc_stats['hits'])+int(arc_stats['misses'])
actual_hits = int(arc_stats['mfu_hits'])+int(arc_stats['mru_hits'])
prt_1('ARC total accesses (hits + misses):', f_hits(all_accesses))
ta_todo = (('Cache hit ratio:', arc_stats['hits']),
('Cache miss ratio:', arc_stats['misses']),
('Actual hit ratio (MFU + MRU hits):', actual_hits))
for title, value in ta_todo:
prt_i2(title, f_perc(value, all_accesses), f_hits(value))
dd_total = int(arc_stats['demand_data_hits']) +\
int(arc_stats['demand_data_misses'])
prt_i2('Data demand efficiency:',
f_perc(arc_stats['demand_data_hits'], dd_total),
f_hits(dd_total))
dp_total = int(arc_stats['prefetch_data_hits']) +\
int(arc_stats['prefetch_data_misses'])
prt_i2('Data prefetch efficiency:',
f_perc(arc_stats['prefetch_data_hits'], dp_total),
f_hits(dp_total))
known_hits = int(arc_stats['mfu_hits']) +\
int(arc_stats['mru_hits']) +\
int(arc_stats['mfu_ghost_hits']) +\
int(arc_stats['mru_ghost_hits'])
anon_hits = int(arc_stats['hits'])-known_hits
print()
print('Cache hits by cache type:')
cl_todo = (('Most frequently used (MFU):', arc_stats['mfu_hits']),
('Most recently used (MRU):', arc_stats['mru_hits']),
('Most frequently used (MFU) ghost:',
arc_stats['mfu_ghost_hits']),
('Most recently used (MRU) ghost:',
arc_stats['mru_ghost_hits']))
for title, value in cl_todo:
prt_i2(title, f_perc(value, arc_stats['hits']), f_hits(value))
# For some reason, anon_hits can turn negative, which is weird. Until we
# have figured out why this happens, we just hide the problem, following
# the behavior of the original arc_summary.
if anon_hits >= 0:
prt_i2('Anonymously used:',
f_perc(anon_hits, arc_stats['hits']), f_hits(anon_hits))
print()
print('Cache hits by data type:')
dt_todo = (('Demand data:', arc_stats['demand_data_hits']),
('Demand prefetch data:', arc_stats['prefetch_data_hits']),
('Demand metadata:', arc_stats['demand_metadata_hits']),
('Demand prefetch metadata:',
arc_stats['prefetch_metadata_hits']))
for title, value in dt_todo:
prt_i2(title, f_perc(value, arc_stats['hits']), f_hits(value))
print()
print('Cache misses by data type:')
dm_todo = (('Demand data:', arc_stats['demand_data_misses']),
('Demand prefetch data:',
arc_stats['prefetch_data_misses']),
('Demand metadata:', arc_stats['demand_metadata_misses']),
('Demand prefetch metadata:',
arc_stats['prefetch_metadata_misses']))
for title, value in dm_todo:
prt_i2(title, f_perc(value, arc_stats['misses']), f_hits(value))
print()
def section_dmu(kstats_dict):
"""Collect information on the DMU"""
zfetch_stats = isolate_section('zfetchstats', kstats_dict)
zfetch_access_total = int(zfetch_stats['hits'])+int(zfetch_stats['misses'])
prt_1('DMU prefetch efficiency:', f_hits(zfetch_access_total))
prt_i2('Hit ratio:', f_perc(zfetch_stats['hits'], zfetch_access_total),
f_hits(zfetch_stats['hits']))
prt_i2('Miss ratio:', f_perc(zfetch_stats['misses'], zfetch_access_total),
f_hits(zfetch_stats['misses']))
print()
def section_l2arc(kstats_dict):
"""Collect information on L2ARC device if present. If not, tell user
that we're skipping the section.
"""
# The L2ARC statistics live in the same section as the normal ARC stuff
arc_stats = isolate_section('arcstats', kstats_dict)
if arc_stats['l2_size'] == '0':
print('L2ARC not detected, skipping section\n')
return
l2_errors = int(arc_stats['l2_writes_error']) +\
int(arc_stats['l2_cksum_bad']) +\
int(arc_stats['l2_io_error'])
l2_access_total = int(arc_stats['l2_hits'])+int(arc_stats['l2_misses'])
health = 'HEALTHY'
if l2_errors > 0:
health = 'DEGRADED'
prt_1('L2ARC status:', health)
l2_todo = (('Low memory aborts:', 'l2_abort_lowmem'),
('Free on write:', 'l2_free_on_write'),
('R/W clashes:', 'l2_rw_clash'),
('Bad checksums:', 'l2_cksum_bad'),
('I/O errors:', 'l2_io_error'))
for title, value in l2_todo:
prt_i1(title, f_hits(arc_stats[value]))
print()
prt_1('L2ARC size (adaptive):', f_bytes(arc_stats['l2_size']))
prt_i2('Compressed:', f_perc(arc_stats['l2_asize'], arc_stats['l2_size']),
f_bytes(arc_stats['l2_asize']))
prt_i2('Header size:',
f_perc(arc_stats['l2_hdr_size'], arc_stats['l2_size']),
f_bytes(arc_stats['l2_hdr_size']))
print()
prt_1('L2ARC breakdown:', f_hits(l2_access_total))
prt_i2('Hit ratio:',
f_perc(arc_stats['l2_hits'], l2_access_total),
f_bytes(arc_stats['l2_hits']))
prt_i2('Miss ratio:',
f_perc(arc_stats['l2_misses'], l2_access_total),
f_bytes(arc_stats['l2_misses']))
prt_i1('Feeds:', f_hits(arc_stats['l2_feeds']))
print()
print('L2ARC writes:')
if arc_stats['l2_writes_done'] != arc_stats['l2_writes_sent']:
prt_i2('Writes sent:', 'FAULTED', f_hits(arc_stats['l2_writes_sent']))
prt_i2('Done ratio:',
f_perc(arc_stats['l2_writes_done'],
arc_stats['l2_writes_sent']),
f_bytes(arc_stats['l2_writes_done']))
prt_i2('Error ratio:',
f_perc(arc_stats['l2_writes_error'],
arc_stats['l2_writes_sent']),
f_bytes(arc_stats['l2_writes_error']))
else:
prt_i2('Writes sent:', '100 %', f_bytes(arc_stats['l2_writes_sent']))
print()
print('L2ARC evicts:')
prt_i1('Lock retries:', f_hits(arc_stats['l2_evict_lock_retry']))
prt_i1('Upon reading:', f_hits(arc_stats['l2_evict_reading']))
print()
def section_spl(*_):
"""Print the SPL parameters, if requested with alternative format
and/or decriptions. This does not use kstats.
"""
spls = get_spl_tunables(SPL_PATH)
keylist = sorted(spls.keys())
print('Solaris Porting Layer (SPL):')
if ARGS.desc:
descriptions = get_descriptions('spl')
for key in keylist:
value = spls[key]
if ARGS.desc:
try:
print(INDENT+'#', descriptions[key])
except KeyError:
print(INDENT+'# (No decription found)') # paranoid
print(format_raw_line(key, value))
print()
def section_tunables(*_):
"""Print the tunables, if requested with alternative format and/or
decriptions. This does not use kstasts.
"""
tunables = get_spl_tunables(TUNABLES_PATH)
keylist = sorted(tunables.keys())
print('Tunables:')
if ARGS.desc:
descriptions = get_descriptions('zfs')
for key in keylist:
value = tunables[key]
if ARGS.desc:
try:
print(INDENT+'#', descriptions[key])
except KeyError:
print(INDENT+'# (No decription found)') # paranoid
print(format_raw_line(key, value))
print()
def section_vdev(kstats_dict):
"""Collect information on VDEV caches"""
# Currently [Nov 2017] the VDEV cache is disabled, because it is actually
# harmful. When this is the case, we just skip the whole entry. See
# https://github.com/zfsonlinux/zfs/blob/master/module/zfs/vdev_cache.c
# for details
tunables = get_spl_tunables(TUNABLES_PATH)
if tunables['zfs_vdev_cache_size'] == '0':
print('VDEV cache disabled, skipping section\n')
return
vdev_stats = isolate_section('vdev_cache_stats', kstats_dict)
vdev_cache_total = int(vdev_stats['hits']) +\
int(vdev_stats['misses']) +\
int(vdev_stats['delegations'])
prt_1('VDEV cache summary:', f_hits(vdev_cache_total))
prt_i2('Hit ratio:', f_perc(vdev_stats['hits'], vdev_cache_total),
f_hits(vdev_stats['hits']))
prt_i2('Miss ratio:', f_perc(vdev_stats['misses'], vdev_cache_total),
f_hits(vdev_stats['misses']))
prt_i2('Delegations:', f_perc(vdev_stats['delegations'], vdev_cache_total),
f_hits(vdev_stats['delegations']))
print()
def section_zil(kstats_dict):
"""Collect information on the ZFS Intent Log. Some of the information
taken from https://github.com/zfsonlinux/zfs/blob/master/include/sys/zil.h
"""
zil_stats = isolate_section('zil', kstats_dict)
prt_1('ZIL committed transactions:',
f_hits(zil_stats['zil_itx_count']))
prt_i1('Commit requests:', f_hits(zil_stats['zil_commit_count']))
prt_i1('Flushes to stable storage:',
f_hits(zil_stats['zil_commit_writer_count']))
prt_i2('Transactions to SLOG storage pool:',
f_bytes(zil_stats['zil_itx_metaslab_slog_bytes']),
f_hits(zil_stats['zil_itx_metaslab_slog_count']))
prt_i2('Transactions to non-SLOG storage pool:',
f_bytes(zil_stats['zil_itx_metaslab_normal_bytes']),
f_hits(zil_stats['zil_itx_metaslab_normal_count']))
print()
section_calls = {'arc': section_arc,
'archits': section_archits,
'dmu': section_dmu,
'l2arc': section_l2arc,
'spl': section_spl,
'tunables': section_tunables,
'vdev': section_vdev,
'zil': section_zil}
def main():
"""Run program. The options to draw a graph and to print all data raw are
treated separately because they come with their own call.
"""
kstats = get_kstats()
if ARGS.graph:
draw_graph(kstats)
sys.exit(0)
print_header()
if ARGS.raw:
print_raw(kstats)
elif ARGS.section:
try:
section_calls[ARGS.section](kstats)
except KeyError:
print('Error: Section "{0}" unknown'.format(ARGS.section))
sys.exit(1)
elif ARGS.page:
print('WARNING: Pages are deprecated, please use "--section"\n')
pages_to_calls = {1: 'arc',
2: 'archits',
3: 'l2arc',
4: 'dmu',
5: 'vdev',
6: 'tunables'}
try:
call = pages_to_calls[ARGS.page]
except KeyError:
print('Error: Page "{0}" not supported'.format(ARGS.page))
sys.exit(1)
else:
section_calls[call](kstats)
else:
# If no parameters were given, we print all sections. We might want to
# change the sequence by hand
calls = sorted(section_calls.keys())
for section in calls:
section_calls[section](kstats)
sys.exit(0)
if __name__ == '__main__':
main()
+13 -1
View File
@@ -1 +1,13 @@
dist_bin_SCRIPTS = arcstat.py
dist_bin_SCRIPTS = arcstat
#
# The arcstat script is compatibile with both Python 2.6 and 3.4.
# As such the python 3 shebang can be replaced at install time when
# targeting a python 2 system. This allows us to maintain a single
# version of the source.
#
if USING_PYTHON_2
install-exec-hook:
sed --in-place 's|^#!/usr/bin/python3|#!/usr/bin/python2|' \
$(DESTDIR)$(bindir)/arcstat
endif
+16 -8
View File
@@ -1,4 +1,4 @@
#!/usr/bin/python
#!/usr/bin/python3
#
# Print out ZFS ARC Statistics exported via kstat(1)
# For a definition of fields, or usage, use arctstat.pl -v
@@ -42,7 +42,8 @@
# @hdr is the array of fields that needs to be printed, so we
# just iterate over this array and print the values using our pretty printer.
#
# This script must remain compatible with Python 2.6+ and Python 3.4+.
#
import sys
import time
@@ -71,7 +72,7 @@ cols = {
"pm%": [3, 100, "Prefetch miss percentage"],
"mhit": [4, 1000, "Metadata hits per second"],
"mmis": [4, 1000, "Metadata misses per second"],
"mread": [4, 1000, "Metadata accesses per second"],
"mread": [5, 1000, "Metadata accesses per second"],
"mh%": [3, 100, "Metadata hit percentage"],
"mm%": [3, 100, "Metadata miss percentage"],
"arcsz": [5, 1024, "ARC Size"],
@@ -92,6 +93,9 @@ cols = {
"l2asize": [7, 1024, "Actual (compressed) size of the L2ARC"],
"l2size": [6, 1024, "Size of the L2ARC"],
"l2bytes": [7, 1024, "bytes read per second from the L2ARC"],
"grow": [4, 1000, "ARC Grow disabled"],
"need": [4, 1024, "ARC Reclaim need"],
"free": [4, 1024, "ARC Free memory"],
}
v = {}
@@ -106,7 +110,7 @@ opfile = None
sep = " " # Default separator is 2 spaces
version = "0.4"
l2exist = False
cmd = ("Usage: arcstat.py [-hvx] [-f fields] [-o file] [-s string] [interval "
cmd = ("Usage: arcstat [-hvx] [-f fields] [-o file] [-s string] [interval "
"[count]]\n")
cur = {}
d = {}
@@ -135,10 +139,10 @@ def usage():
sys.stderr.write("\t -s : Override default field separator with custom "
"character or string\n")
sys.stderr.write("\nExamples:\n")
sys.stderr.write("\tarcstat.py -o /tmp/a.log 2 10\n")
sys.stderr.write("\tarcstat.py -s \",\" -o /tmp/a.log 2 10\n")
sys.stderr.write("\tarcstat.py -v\n")
sys.stderr.write("\tarcstat.py -f time,hit%,dh%,ph%,mh% 1\n")
sys.stderr.write("\tarcstat -o /tmp/a.log 2 10\n")
sys.stderr.write("\tarcstat -s \",\" -o /tmp/a.log 2 10\n")
sys.stderr.write("\tarcstat -v\n")
sys.stderr.write("\tarcstat -f time,hit%,dh%,ph%,mh% 1\n")
sys.stderr.write("\n")
sys.exit(1)
@@ -423,6 +427,10 @@ def calculate():
v["l2size"] = cur["l2_size"]
v["l2bytes"] = d["l2_read_bytes"] / sint
v["grow"] = 0 if cur["arc_no_grow"] else 1
v["need"] = cur["arc_need_free"]
v["free"] = cur["arc_sys_free"]
def main():
global sint
+13 -1
View File
@@ -1 +1,13 @@
dist_bin_SCRIPTS = dbufstat.py
dist_bin_SCRIPTS = dbufstat
#
# The dbufstat script is compatibile with both Python 2.6 and 3.4.
# As such the python 3 shebang can be replaced at install time when
# targeting a python 2 system. This allows us to maintain a single
# version of the source.
#
if USING_PYTHON_2
install-exec-hook:
sed --in-place 's|^#!/usr/bin/python3|#!/usr/bin/python2|' \
$(DESTDIR)$(bindir)/dbufstat
endif
@@ -1,4 +1,4 @@
#!/usr/bin/python
#!/usr/bin/python3
#
# Print out statistics for all cached dmu buffers. This information
# is available through the dbufs kstat and may be post-processed as
@@ -27,14 +27,17 @@
# Copyright (C) 2013 Lawrence Livermore National Security, LLC.
# Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
#
# This script must remain compatible with Python 2.6+ and Python 3.4+.
#
import sys
import getopt
import errno
import re
bhdr = ["pool", "objset", "object", "level", "blkid", "offset", "dbsize"]
bxhdr = ["pool", "objset", "object", "level", "blkid", "offset", "dbsize",
"meta", "state", "dbholds", "list", "atype", "flags",
"meta", "state", "dbholds", "dbc", "list", "atype", "flags",
"count", "asize", "access", "mru", "gmru", "mfu", "gmfu", "l2",
"l2_dattr", "l2_asize", "l2_comp", "aholds", "dtype", "btype",
"data_bs", "meta_bs", "bsize", "lvls", "dholds", "blocks", "dsize"]
@@ -45,7 +48,7 @@ dxhdr = ["pool", "objset", "object", "dtype", "btype", "data_bs", "meta_bs",
"bsize", "lvls", "dholds", "blocks", "dsize", "cached", "direct",
"indirect", "bonus", "spill"]
dincompat = ["level", "blkid", "offset", "dbsize", "meta", "state", "dbholds",
"list", "atype", "flags", "count", "asize", "access",
"dbc", "list", "atype", "flags", "count", "asize", "access",
"mru", "gmru", "mfu", "gmfu", "l2", "l2_dattr", "l2_asize",
"l2_comp", "aholds"]
@@ -53,7 +56,7 @@ thdr = ["pool", "objset", "dtype", "cached"]
txhdr = ["pool", "objset", "dtype", "cached", "direct", "indirect",
"bonus", "spill"]
tincompat = ["object", "level", "blkid", "offset", "dbsize", "meta", "state",
"dbholds", "list", "atype", "flags", "count", "asize",
"dbc", "dbholds", "list", "atype", "flags", "count", "asize",
"access", "mru", "gmru", "mfu", "gmfu", "l2", "l2_dattr",
"l2_asize", "l2_comp", "aholds", "btype", "data_bs", "meta_bs",
"bsize", "lvls", "dholds", "blocks", "dsize"]
@@ -70,9 +73,10 @@ cols = {
"meta": [4, -1, "is this buffer metadata?"],
"state": [5, -1, "state of buffer (read, cached, etc)"],
"dbholds": [7, 1000, "number of holds on buffer"],
"dbc": [3, -1, "in dbuf cache"],
"list": [4, -1, "which ARC list contains this buffer"],
"atype": [7, -1, "ARC header type (data or metadata)"],
"flags": [8, -1, "ARC read flags"],
"flags": [9, -1, "ARC read flags"],
"count": [5, -1, "ARC data count"],
"asize": [7, 1024, "size of this ARC buffer"],
"access": [10, -1, "time this ARC buffer was last accessed"],
@@ -104,8 +108,8 @@ cols = {
hdr = None
xhdr = None
sep = " " # Default separator is 2 spaces
cmd = ("Usage: dbufstat.py [-bdhrtvx] [-i file] [-f fields] [-o file] "
"[-s string]\n")
cmd = ("Usage: dbufstat [-bdhnrtvx] [-i file] [-f fields] [-o file] "
"[-s string] [-F filter]\n")
raw = 0
@@ -151,6 +155,7 @@ def usage():
sys.stderr.write("\t -b : Print table of information for each dbuf\n")
sys.stderr.write("\t -d : Print table of information for each dnode\n")
sys.stderr.write("\t -h : Print this help message\n")
sys.stderr.write("\t -n : Exclude header from output\n")
sys.stderr.write("\t -r : Print raw values\n")
sys.stderr.write("\t -t : Print table of information for each dnode type"
"\n")
@@ -162,11 +167,13 @@ def usage():
sys.stderr.write("\t -o : Redirect output to the specified file\n")
sys.stderr.write("\t -s : Override default field separator with custom "
"character or string\n")
sys.stderr.write("\t -F : Filter output by value or regex\n")
sys.stderr.write("\nExamples:\n")
sys.stderr.write("\tdbufstat.py -d -o /tmp/d.log\n")
sys.stderr.write("\tdbufstat.py -t -s \",\" -o /tmp/t.log\n")
sys.stderr.write("\tdbufstat.py -v\n")
sys.stderr.write("\tdbufstat.py -d -f pool,object,objset,dsize,cached\n")
sys.stderr.write("\tdbufstat -d -o /tmp/d.log\n")
sys.stderr.write("\tdbufstat -t -s \",\" -o /tmp/t.log\n")
sys.stderr.write("\tdbufstat -v\n")
sys.stderr.write("\tdbufstat -d -f pool,object,objset,dsize,cached\n")
sys.stderr.write("\tdbufstat -bx -F dbc=1,objset=54,pool=testpool\n")
sys.stderr.write("\n")
sys.exit(1)
@@ -228,7 +235,8 @@ def print_header():
def get_typestring(t):
type_strings = ["DMU_OT_NONE",
ot_strings = [
"DMU_OT_NONE",
# general:
"DMU_OT_OBJECT_DIRECTORY",
"DMU_OT_OBJECT_ARRAY",
@@ -291,15 +299,39 @@ def get_typestring(t):
"DMU_OT_DEADLIST_HDR",
"DMU_OT_DSL_CLONES",
"DMU_OT_BPOBJ_SUBOBJ"]
otn_strings = {
0x80: "DMU_OTN_UINT8_DATA",
0xc0: "DMU_OTN_UINT8_METADATA",
0x81: "DMU_OTN_UINT16_DATA",
0xc1: "DMU_OTN_UINT16_METADATA",
0x82: "DMU_OTN_UINT32_DATA",
0xc2: "DMU_OTN_UINT32_METADATA",
0x83: "DMU_OTN_UINT64_DATA",
0xc3: "DMU_OTN_UINT64_METADATA",
0x84: "DMU_OTN_ZAP_DATA",
0xc4: "DMU_OTN_ZAP_METADATA",
0xa0: "DMU_OTN_UINT8_ENC_DATA",
0xe0: "DMU_OTN_UINT8_ENC_METADATA",
0xa1: "DMU_OTN_UINT16_ENC_DATA",
0xe1: "DMU_OTN_UINT16_ENC_METADATA",
0xa2: "DMU_OTN_UINT32_ENC_DATA",
0xe2: "DMU_OTN_UINT32_ENC_METADATA",
0xa3: "DMU_OTN_UINT64_ENC_DATA",
0xe3: "DMU_OTN_UINT64_ENC_METADATA",
0xa4: "DMU_OTN_ZAP_ENC_DATA",
0xe4: "DMU_OTN_ZAP_ENC_METADATA"}
# If "-rr" option is used, don't convert to string representation
if raw > 1:
return "%i" % t
try:
return type_strings[t]
except IndexError:
return "%i" % t
if t < len(ot_strings):
return ot_strings[t]
else:
return otn_strings[t]
except (IndexError, KeyError):
return "(UNKNOWN)"
def get_compstring(c):
@@ -384,12 +416,32 @@ def update_dict(d, k, line, labels):
return d
def print_dict(d):
print_header()
def skip_line(vals, filters):
'''
Determines if a line should be skipped during printing
based on a set of filters
'''
if len(filters) == 0:
return False
for key in vals:
if key in filters:
val = prettynum(cols[key][0], cols[key][1], vals[key]).strip()
# we want a full match here
if re.match("(?:" + filters[key] + r")\Z", val) is None:
return True
return False
def print_dict(d, filters, noheader):
if not noheader:
print_header()
for pool in list(d.keys()):
for objset in list(d[pool].keys()):
for v in list(d[pool][objset].values()):
print_values(v)
if not skip_line(v, filters):
print_values(v)
def dnodes_build_dict(filehandle):
@@ -430,7 +482,7 @@ def types_build_dict(filehandle):
return types
def buffers_print_all(filehandle):
def buffers_print_all(filehandle, filters, noheader):
labels = dict()
# First 3 lines are header information, skip the first two
@@ -441,11 +493,14 @@ def buffers_print_all(filehandle):
for i, v in enumerate(next(filehandle).split()):
labels[v] = i
print_header()
if not noheader:
print_header()
# The rest of the file is buffer information
for line in filehandle:
print_values(parse_line(line.split(), labels))
vals = parse_line(line.split(), labels)
if not skip_line(vals, filters):
print_values(vals)
def main():
@@ -462,11 +517,13 @@ def main():
tflag = False
vflag = False
xflag = False
nflag = False
filters = dict()
try:
opts, args = getopt.getopt(
sys.argv[1:],
"bdf:hi:o:rs:tvx",
"bdf:hi:o:rs:tvxF:n",
[
"buffers",
"dnodes",
@@ -477,7 +534,8 @@ def main():
"separator",
"types",
"verbose",
"extended"
"extended",
"filter"
]
)
except getopt.error:
@@ -507,6 +565,35 @@ def main():
vflag = True
if opt in ('-x', '--extended'):
xflag = True
if opt in ('-n', '--noheader'):
nflag = True
if opt in ('-F', '--filter'):
fils = [x.strip() for x in arg.split(",")]
for fil in fils:
f = [x.strip() for x in fil.split("=")]
if len(f) != 2:
sys.stderr.write("Invalid filter '%s'.\n" % fil)
sys.exit(1)
if f[0] not in cols:
sys.stderr.write("Invalid field '%s' in filter.\n" % f[0])
sys.exit(1)
if f[0] in filters:
sys.stderr.write("Field '%s' specified multiple times in "
"filter.\n" % f[0])
sys.exit(1)
try:
re.compile("(?:" + f[1] + r")\Z")
except re.error:
sys.stderr.write("Invalid regex for field '%s' in "
"filter.\n" % f[0])
sys.exit(1)
filters[f[0]] = f[1]
if hflag or (xflag and desired_cols):
usage()
@@ -569,13 +656,13 @@ def main():
sys.exit(1)
if bflag:
buffers_print_all(sys.stdin)
buffers_print_all(sys.stdin, filters, nflag)
if dflag:
print_dict(dnodes_build_dict(sys.stdin))
print_dict(dnodes_build_dict(sys.stdin), filters, nflag)
if tflag:
print_dict(types_build_dict(sys.stdin))
print_dict(types_build_dict(sys.stdin), filters, nflag)
if __name__ == '__main__':
+1 -4
View File
@@ -18,7 +18,4 @@ mount_zfs_SOURCES = \
mount_zfs_LDADD = \
$(top_builddir)/lib/libnvpair/libnvpair.la \
$(top_builddir)/lib/libuutil/libuutil.la \
$(top_builddir)/lib/libzpool/libzpool.la \
$(top_builddir)/lib/libzfs/libzfs.la \
$(top_builddir)/lib/libzfs_core/libzfs_core.la
$(top_builddir)/lib/libzfs/libzfs.la
+2
View File
@@ -31,9 +31,11 @@
#include <sys/mntent.h>
#include <sys/stat.h>
#include <libzfs.h>
#include <libzutil.h>
#include <locale.h>
#include <getopt.h>
#include <fcntl.h>
#include <errno.h>
#define ZS_COMMENT 0x00000000 /* comment */
#define ZS_ZFSUTIL 0x00000001 /* caller is zfs(8) */
+5 -4
View File
@@ -1,7 +1,10 @@
include $(top_srcdir)/config/Rules.am
AM_CFLAGS += $(DEBUG_STACKFLAGS) $(FRAME_LARGER_THAN)
AM_CPPFLAGS += -DDEBUG
# Includes kernel code, generate warnings for large stack frames
AM_CFLAGS += $(FRAME_LARGER_THAN)
# Unconditionally enable ASSERTs
AM_CPPFLAGS += -DDEBUG -UNDEBUG
DEFAULT_INCLUDES += \
-I$(top_srcdir)/include \
@@ -15,8 +18,6 @@ raidz_test_SOURCES = \
raidz_bench.c
raidz_test_LDADD = \
$(top_builddir)/lib/libnvpair/libnvpair.la \
$(top_builddir)/lib/libuutil/libuutil.la \
$(top_builddir)/lib/libzpool/libzpool.la
raidz_test_LDADD += -lm -ldl
+2 -4
View File
@@ -702,10 +702,8 @@ run_sweep(void)
opts->rto_dsize = size_v[s];
opts->rto_v = 0; /* be quiet */
VERIFY3P(zk_thread_create(NULL, 0,
(thread_func_t)sweep_thread,
(void *) opts, 0, NULL, TS_RUN, 0,
PTHREAD_CREATE_JOINABLE), !=, NULL);
VERIFY3P(thread_create(NULL, 0, sweep_thread, (void *) opts,
0, NULL, TS_RUN, defclsyspri), !=, NULL);
}
exit:
+5 -6
View File
@@ -1,6 +1,7 @@
include $(top_srcdir)/config/Rules.am
AM_CPPFLAGS += -DDEBUG
# Unconditionally enable debugging for zdb
AM_CPPFLAGS += -DDEBUG -UNDEBUG
DEFAULT_INCLUDES += \
-I$(top_srcdir)/include \
@@ -10,11 +11,9 @@ sbin_PROGRAMS = zdb
zdb_SOURCES = \
zdb.c \
zdb_il.c
zdb_il.c \
zdb.h
zdb_LDADD = \
$(top_builddir)/lib/libnvpair/libnvpair.la \
$(top_builddir)/lib/libuutil/libuutil.la \
$(top_builddir)/lib/libzpool/libzpool.la \
$(top_builddir)/lib/libzfs/libzfs.la \
$(top_builddir)/lib/libzfs_core/libzfs_core.la
$(top_builddir)/lib/libzpool/libzpool.la
+2080 -350
View File
File diff suppressed because it is too large Load Diff
+11 -3
View File
@@ -18,8 +18,16 @@
*
* CDDL HEADER END
*/
/*
* Copyright 2017 Spectra Logic Corp Inc. All rights reserved.
* Use is subject to license terms.
*/
#ifndef _LIBSPL_ATTR_H
#define _LIBSPL_ATTR_H
#endif /* _LIBSPL_ATTR_H */
#ifndef _ZDB_H
#define _ZDB_H
void dump_intent_log(zilog_t *);
extern uint8_t dump_opt[256];
#endif /* _ZDB_H */
+92 -66
View File
@@ -25,7 +25,7 @@
*/
/*
* Copyright (c) 2013, 2016 by Delphix. All rights reserved.
* Copyright (c) 2013, 2017 by Delphix. All rights reserved.
*/
/*
@@ -42,11 +42,14 @@
#include <sys/resource.h>
#include <sys/zil.h>
#include <sys/zil_impl.h>
#include <sys/spa_impl.h>
#include <sys/abd.h>
#include "zdb.h"
extern uint8_t dump_opt[256];
static char prefix[4] = "\t\t\t";
static char tab_prefix[4] = "\t\t\t";
static void
print_log_bp(const blkptr_t *bp, const char *prefix)
@@ -59,8 +62,9 @@ print_log_bp(const blkptr_t *bp, const char *prefix)
/* ARGSUSED */
static void
zil_prt_rec_create(zilog_t *zilog, int txtype, lr_create_t *lr)
zil_prt_rec_create(zilog_t *zilog, int txtype, void *arg)
{
lr_create_t *lr = arg;
time_t crtime = lr->lr_crtime[0];
char *name, *link;
lr_attr_t *lrattr;
@@ -75,49 +79,55 @@ zil_prt_rec_create(zilog_t *zilog, int txtype, lr_create_t *lr)
if (txtype == TX_SYMLINK) {
link = name + strlen(name) + 1;
(void) printf("%s%s -> %s\n", prefix, name, link);
(void) printf("%s%s -> %s\n", tab_prefix, name, link);
} else if (txtype != TX_MKXATTR) {
(void) printf("%s%s\n", prefix, name);
(void) printf("%s%s\n", tab_prefix, name);
}
(void) printf("%s%s", prefix, ctime(&crtime));
(void) printf("%sdoid %llu, foid %llu, slots %llu, mode %llo\n", prefix,
(u_longlong_t)lr->lr_doid,
(void) printf("%s%s", tab_prefix, ctime(&crtime));
(void) printf("%sdoid %llu, foid %llu, slots %llu, mode %llo\n",
tab_prefix, (u_longlong_t)lr->lr_doid,
(u_longlong_t)LR_FOID_GET_OBJ(lr->lr_foid),
(u_longlong_t)LR_FOID_GET_SLOTS(lr->lr_foid),
(longlong_t)lr->lr_mode);
(void) printf("%suid %llu, gid %llu, gen %llu, rdev 0x%llx\n", prefix,
(void) printf("%suid %llu, gid %llu, gen %llu, rdev 0x%llx\n",
tab_prefix,
(u_longlong_t)lr->lr_uid, (u_longlong_t)lr->lr_gid,
(u_longlong_t)lr->lr_gen, (u_longlong_t)lr->lr_rdev);
}
/* ARGSUSED */
static void
zil_prt_rec_remove(zilog_t *zilog, int txtype, lr_remove_t *lr)
zil_prt_rec_remove(zilog_t *zilog, int txtype, void *arg)
{
(void) printf("%sdoid %llu, name %s\n", prefix,
lr_remove_t *lr = arg;
(void) printf("%sdoid %llu, name %s\n", tab_prefix,
(u_longlong_t)lr->lr_doid, (char *)(lr + 1));
}
/* ARGSUSED */
static void
zil_prt_rec_link(zilog_t *zilog, int txtype, lr_link_t *lr)
zil_prt_rec_link(zilog_t *zilog, int txtype, void *arg)
{
(void) printf("%sdoid %llu, link_obj %llu, name %s\n", prefix,
lr_link_t *lr = arg;
(void) printf("%sdoid %llu, link_obj %llu, name %s\n", tab_prefix,
(u_longlong_t)lr->lr_doid, (u_longlong_t)lr->lr_link_obj,
(char *)(lr + 1));
}
/* ARGSUSED */
static void
zil_prt_rec_rename(zilog_t *zilog, int txtype, lr_rename_t *lr)
zil_prt_rec_rename(zilog_t *zilog, int txtype, void *arg)
{
lr_rename_t *lr = arg;
char *snm = (char *)(lr + 1);
char *tnm = snm + strlen(snm) + 1;
(void) printf("%ssdoid %llu, tdoid %llu\n", prefix,
(void) printf("%ssdoid %llu, tdoid %llu\n", tab_prefix,
(u_longlong_t)lr->lr_sdoid, (u_longlong_t)lr->lr_tdoid);
(void) printf("%ssrc %s tgt %s\n", prefix, snm, tnm);
(void) printf("%ssrc %s tgt %s\n", tab_prefix, snm, tnm);
}
/* ARGSUSED */
@@ -125,9 +135,8 @@ static int
zil_prt_rec_write_cb(void *data, size_t len, void *unused)
{
char *cdata = data;
int i;
for (i = 0; i < len; i++) {
for (size_t i = 0; i < len; i++) {
if (isprint(*cdata))
(void) printf("%c ", *cdata);
else
@@ -139,15 +148,16 @@ zil_prt_rec_write_cb(void *data, size_t len, void *unused)
/* ARGSUSED */
static void
zil_prt_rec_write(zilog_t *zilog, int txtype, lr_write_t *lr)
zil_prt_rec_write(zilog_t *zilog, int txtype, void *arg)
{
lr_write_t *lr = arg;
abd_t *data;
blkptr_t *bp = &lr->lr_blkptr;
zbookmark_phys_t zb;
int verbose = MAX(dump_opt['d'], dump_opt['i']);
int error;
(void) printf("%sfoid %llu, offset %llx, length %llx\n", prefix,
(void) printf("%sfoid %llu, offset %llx, length %llx\n", tab_prefix,
(u_longlong_t)lr->lr_foid, (u_longlong_t)lr->lr_offset,
(u_longlong_t)lr->lr_length);
@@ -155,20 +165,21 @@ zil_prt_rec_write(zilog_t *zilog, int txtype, lr_write_t *lr)
return;
if (lr->lr_common.lrc_reclen == sizeof (lr_write_t)) {
(void) printf("%shas blkptr, %s\n", prefix,
(void) printf("%shas blkptr, %s\n", tab_prefix,
!BP_IS_HOLE(bp) &&
bp->blk_birth >= spa_first_txg(zilog->zl_spa) ?
bp->blk_birth >= spa_min_claim_txg(zilog->zl_spa) ?
"will claim" : "won't claim");
print_log_bp(bp, prefix);
print_log_bp(bp, tab_prefix);
if (BP_IS_HOLE(bp)) {
(void) printf("\t\t\tLSIZE 0x%llx\n",
(u_longlong_t)BP_GET_LSIZE(bp));
(void) printf("%s<hole>\n", prefix);
(void) printf("%s<hole>\n", tab_prefix);
return;
}
if (bp->blk_birth < zilog->zl_header->zh_claim_txg) {
(void) printf("%s<block already committed>\n", prefix);
(void) printf("%s<block already committed>\n",
tab_prefix);
return;
}
@@ -188,7 +199,7 @@ zil_prt_rec_write(zilog_t *zilog, int txtype, lr_write_t *lr)
abd_copy_from_buf(data, lr + 1, lr->lr_length);
}
(void) printf("%s", prefix);
(void) printf("%s", tab_prefix);
(void) abd_iterate_func(data,
0, MIN(lr->lr_length, (verbose < 6 ? 20 : SPA_MAXBLOCKSIZE)),
zil_prt_rec_write_cb, NULL);
@@ -200,52 +211,55 @@ out:
/* ARGSUSED */
static void
zil_prt_rec_truncate(zilog_t *zilog, int txtype, lr_truncate_t *lr)
zil_prt_rec_truncate(zilog_t *zilog, int txtype, void *arg)
{
(void) printf("%sfoid %llu, offset 0x%llx, length 0x%llx\n", prefix,
lr_truncate_t *lr = arg;
(void) printf("%sfoid %llu, offset 0x%llx, length 0x%llx\n", tab_prefix,
(u_longlong_t)lr->lr_foid, (longlong_t)lr->lr_offset,
(u_longlong_t)lr->lr_length);
}
/* ARGSUSED */
static void
zil_prt_rec_setattr(zilog_t *zilog, int txtype, lr_setattr_t *lr)
zil_prt_rec_setattr(zilog_t *zilog, int txtype, void *arg)
{
lr_setattr_t *lr = arg;
time_t atime = (time_t)lr->lr_atime[0];
time_t mtime = (time_t)lr->lr_mtime[0];
(void) printf("%sfoid %llu, mask 0x%llx\n", prefix,
(void) printf("%sfoid %llu, mask 0x%llx\n", tab_prefix,
(u_longlong_t)lr->lr_foid, (u_longlong_t)lr->lr_mask);
if (lr->lr_mask & AT_MODE) {
(void) printf("%sAT_MODE %llo\n", prefix,
(void) printf("%sAT_MODE %llo\n", tab_prefix,
(longlong_t)lr->lr_mode);
}
if (lr->lr_mask & AT_UID) {
(void) printf("%sAT_UID %llu\n", prefix,
(void) printf("%sAT_UID %llu\n", tab_prefix,
(u_longlong_t)lr->lr_uid);
}
if (lr->lr_mask & AT_GID) {
(void) printf("%sAT_GID %llu\n", prefix,
(void) printf("%sAT_GID %llu\n", tab_prefix,
(u_longlong_t)lr->lr_gid);
}
if (lr->lr_mask & AT_SIZE) {
(void) printf("%sAT_SIZE %llu\n", prefix,
(void) printf("%sAT_SIZE %llu\n", tab_prefix,
(u_longlong_t)lr->lr_size);
}
if (lr->lr_mask & AT_ATIME) {
(void) printf("%sAT_ATIME %llu.%09llu %s", prefix,
(void) printf("%sAT_ATIME %llu.%09llu %s", tab_prefix,
(u_longlong_t)lr->lr_atime[0],
(u_longlong_t)lr->lr_atime[1],
ctime(&atime));
}
if (lr->lr_mask & AT_MTIME) {
(void) printf("%sAT_MTIME %llu.%09llu %s", prefix,
(void) printf("%sAT_MTIME %llu.%09llu %s", tab_prefix,
(u_longlong_t)lr->lr_mtime[0],
(u_longlong_t)lr->lr_mtime[1],
ctime(&mtime));
@@ -254,41 +268,43 @@ zil_prt_rec_setattr(zilog_t *zilog, int txtype, lr_setattr_t *lr)
/* ARGSUSED */
static void
zil_prt_rec_acl(zilog_t *zilog, int txtype, lr_acl_t *lr)
zil_prt_rec_acl(zilog_t *zilog, int txtype, void *arg)
{
(void) printf("%sfoid %llu, aclcnt %llu\n", prefix,
lr_acl_t *lr = arg;
(void) printf("%sfoid %llu, aclcnt %llu\n", tab_prefix,
(u_longlong_t)lr->lr_foid, (u_longlong_t)lr->lr_aclcnt);
}
typedef void (*zil_prt_rec_func_t)(zilog_t *, int, void *);
typedef struct zil_rec_info {
zil_prt_rec_func_t zri_print;
char *zri_name;
const char *zri_name;
uint64_t zri_count;
} zil_rec_info_t;
static zil_rec_info_t zil_rec_info[TX_MAX_TYPE] = {
{ NULL, "Total " },
{ (zil_prt_rec_func_t)zil_prt_rec_create, "TX_CREATE " },
{ (zil_prt_rec_func_t)zil_prt_rec_create, "TX_MKDIR " },
{ (zil_prt_rec_func_t)zil_prt_rec_create, "TX_MKXATTR " },
{ (zil_prt_rec_func_t)zil_prt_rec_create, "TX_SYMLINK " },
{ (zil_prt_rec_func_t)zil_prt_rec_remove, "TX_REMOVE " },
{ (zil_prt_rec_func_t)zil_prt_rec_remove, "TX_RMDIR " },
{ (zil_prt_rec_func_t)zil_prt_rec_link, "TX_LINK " },
{ (zil_prt_rec_func_t)zil_prt_rec_rename, "TX_RENAME " },
{ (zil_prt_rec_func_t)zil_prt_rec_write, "TX_WRITE " },
{ (zil_prt_rec_func_t)zil_prt_rec_truncate, "TX_TRUNCATE " },
{ (zil_prt_rec_func_t)zil_prt_rec_setattr, "TX_SETATTR " },
{ (zil_prt_rec_func_t)zil_prt_rec_acl, "TX_ACL_V0 " },
{ (zil_prt_rec_func_t)zil_prt_rec_acl, "TX_ACL_ACL " },
{ (zil_prt_rec_func_t)zil_prt_rec_create, "TX_CREATE_ACL " },
{ (zil_prt_rec_func_t)zil_prt_rec_create, "TX_CREATE_ATTR " },
{ (zil_prt_rec_func_t)zil_prt_rec_create, "TX_CREATE_ACL_ATTR " },
{ (zil_prt_rec_func_t)zil_prt_rec_create, "TX_MKDIR_ACL " },
{ (zil_prt_rec_func_t)zil_prt_rec_create, "TX_MKDIR_ATTR " },
{ (zil_prt_rec_func_t)zil_prt_rec_create, "TX_MKDIR_ACL_ATTR " },
{ (zil_prt_rec_func_t)zil_prt_rec_write, "TX_WRITE2 " },
{.zri_print = NULL, .zri_name = "Total "},
{.zri_print = zil_prt_rec_create, .zri_name = "TX_CREATE "},
{.zri_print = zil_prt_rec_create, .zri_name = "TX_MKDIR "},
{.zri_print = zil_prt_rec_create, .zri_name = "TX_MKXATTR "},
{.zri_print = zil_prt_rec_create, .zri_name = "TX_SYMLINK "},
{.zri_print = zil_prt_rec_remove, .zri_name = "TX_REMOVE "},
{.zri_print = zil_prt_rec_remove, .zri_name = "TX_RMDIR "},
{.zri_print = zil_prt_rec_link, .zri_name = "TX_LINK "},
{.zri_print = zil_prt_rec_rename, .zri_name = "TX_RENAME "},
{.zri_print = zil_prt_rec_write, .zri_name = "TX_WRITE "},
{.zri_print = zil_prt_rec_truncate, .zri_name = "TX_TRUNCATE "},
{.zri_print = zil_prt_rec_setattr, .zri_name = "TX_SETATTR "},
{.zri_print = zil_prt_rec_acl, .zri_name = "TX_ACL_V0 "},
{.zri_print = zil_prt_rec_acl, .zri_name = "TX_ACL_ACL "},
{.zri_print = zil_prt_rec_create, .zri_name = "TX_CREATE_ACL "},
{.zri_print = zil_prt_rec_create, .zri_name = "TX_CREATE_ATTR "},
{.zri_print = zil_prt_rec_create, .zri_name = "TX_CREATE_ACL_ATTR "},
{.zri_print = zil_prt_rec_create, .zri_name = "TX_MKDIR_ACL "},
{.zri_print = zil_prt_rec_create, .zri_name = "TX_MKDIR_ATTR "},
{.zri_print = zil_prt_rec_create, .zri_name = "TX_MKDIR_ACL_ATTR "},
{.zri_print = zil_prt_rec_write, .zri_name = "TX_WRITE2 "},
};
/* ARGSUSED */
@@ -311,8 +327,13 @@ print_log_record(zilog_t *zilog, lr_t *lr, void *arg, uint64_t claim_txg)
(u_longlong_t)lr->lrc_txg,
(u_longlong_t)lr->lrc_seq);
if (txtype && verbose >= 3)
zil_rec_info[txtype].zri_print(zilog, txtype, lr);
if (txtype && verbose >= 3) {
if (!zilog->zl_os->os_encrypted) {
zil_rec_info[txtype].zri_print(zilog, txtype, lr);
} else {
(void) printf("%s(encrypted)\n", tab_prefix);
}
}
zil_rec_info[txtype].zri_count++;
zil_rec_info[0].zri_count++;
@@ -326,7 +347,7 @@ print_log_block(zilog_t *zilog, blkptr_t *bp, void *arg, uint64_t claim_txg)
{
char blkbuf[BP_SPRINTF_LEN + 10];
int verbose = MAX(dump_opt['d'], dump_opt['i']);
char *claim;
const char *claim;
if (verbose <= 3)
return (0);
@@ -341,7 +362,7 @@ print_log_block(zilog_t *zilog, blkptr_t *bp, void *arg, uint64_t claim_txg)
if (claim_txg != 0)
claim = "already claimed";
else if (bp->blk_birth >= spa_first_txg(zilog->zl_spa))
else if (bp->blk_birth >= spa_min_claim_txg(zilog->zl_spa))
claim = "will claim";
else
claim = "won't claim";
@@ -355,7 +376,7 @@ print_log_block(zilog_t *zilog, blkptr_t *bp, void *arg, uint64_t claim_txg)
static void
print_log_stats(int verbose)
{
int i, w, p10;
unsigned i, w, p10;
if (verbose > 3)
(void) printf("\n");
@@ -396,10 +417,15 @@ dump_intent_log(zilog_t *zilog)
for (i = 0; i < TX_MAX_TYPE; i++)
zil_rec_info[i].zri_count = 0;
/* see comment in zil_claim() or zil_check_log_chain() */
if (zilog->zl_spa->spa_uberblock.ub_checkpoint_txg != 0 &&
zh->zh_claim_txg == 0)
return;
if (verbose >= 2) {
(void) printf("\n");
(void) zil_parse(zilog, print_log_block, print_log_record, NULL,
zh->zh_claim_txg);
zh->zh_claim_txg, B_FALSE);
print_log_stats(verbose);
}
}
+5 -51
View File
@@ -1,11 +1,11 @@
SUBDIRS = zed.d
include $(top_srcdir)/config/Rules.am
DEFAULT_INCLUDES += \
-I$(top_srcdir)/include \
-I$(top_srcdir)/lib/libspl/include
EXTRA_DIST = zed.d/README
sbin_PROGRAMS = zed
ZED_SRC = \
@@ -40,55 +40,9 @@ FMA_SRC = \
zed_SOURCES = $(ZED_SRC) $(FMA_SRC)
zed_LDADD = \
$(top_builddir)/lib/libavl/libavl.la \
$(top_builddir)/lib/libnvpair/libnvpair.la \
$(top_builddir)/lib/libspl/libspl.la \
$(top_builddir)/lib/libuutil/libuutil.la \
$(top_builddir)/lib/libzpool/libzpool.la \
$(top_builddir)/lib/libzfs/libzfs.la \
$(top_builddir)/lib/libzfs_core/libzfs_core.la
$(top_builddir)/lib/libzfs/libzfs.la
zed_LDFLAGS = -lrt -pthread
zedconfdir = $(sysconfdir)/zfs/zed.d
dist_zedconf_DATA = \
zed.d/zed-functions.sh \
zed.d/zed.rc
zedexecdir = $(libexecdir)/zfs/zed.d
dist_zedexec_SCRIPTS = \
zed.d/all-debug.sh \
zed.d/all-syslog.sh \
zed.d/data-notify.sh \
zed.d/generic-notify.sh \
zed.d/resilver_finish-notify.sh \
zed.d/scrub_finish-notify.sh \
zed.d/statechange-led.sh \
zed.d/statechange-notify.sh \
zed.d/vdev_clear-led.sh \
zed.d/vdev_attach-led.sh \
zed.d/pool_import-led.sh \
zed.d/resilver_finish-start-scrub.sh
zedconfdefaults = \
all-syslog.sh \
data-notify.sh \
resilver_finish-notify.sh \
scrub_finish-notify.sh \
statechange-led.sh \
statechange-notify.sh \
vdev_clear-led.sh \
vdev_attach-led.sh \
pool_import-led.sh \
resilver_finish-start-scrub.sh
install-data-hook:
$(MKDIR_P) "$(DESTDIR)$(zedconfdir)"
for f in $(zedconfdefaults); do \
test -f "$(DESTDIR)$(zedconfdir)/$${f}" -o \
-L "$(DESTDIR)$(zedconfdir)/$${f}" || \
ln -s "$(zedexecdir)/$${f}" "$(DESTDIR)$(zedconfdir)"; \
done
chmod 0600 "$(DESTDIR)$(zedconfdir)/zed.rc"
zed_LDADD += -lrt
zed_LDFLAGS = -pthread
+92 -38
View File
@@ -12,6 +12,7 @@
/*
* Copyright (c) 2016, Intel Corporation.
* Copyright (c) 2018, loli10K <ezomori.nozomu@gmail.com>
*/
#include <libnvpair.h>
@@ -53,13 +54,25 @@ pthread_t g_agents_tid;
libzfs_handle_t *g_zfs_hdl;
/* guid search data */
typedef enum device_type {
DEVICE_TYPE_L2ARC, /* l2arc device */
DEVICE_TYPE_SPARE, /* spare device */
DEVICE_TYPE_PRIMARY /* any primary pool storage device */
} device_type_t;
typedef struct guid_search {
uint64_t gs_pool_guid;
uint64_t gs_vdev_guid;
char *gs_devid;
device_type_t gs_vdev_type;
uint64_t gs_vdev_expandtime; /* vdev expansion time */
} guid_search_t;
static void
/*
* Walks the vdev tree recursively looking for a matching devid.
* Returns B_TRUE as soon as a matching device is found, B_FALSE otherwise.
*/
static boolean_t
zfs_agent_iter_vdev(zpool_handle_t *zhp, nvlist_t *nvl, void *arg)
{
guid_search_t *gsp = arg;
@@ -72,19 +85,48 @@ zfs_agent_iter_vdev(zpool_handle_t *zhp, nvlist_t *nvl, void *arg)
*/
if (nvlist_lookup_nvlist_array(nvl, ZPOOL_CONFIG_CHILDREN,
&child, &children) == 0) {
for (c = 0; c < children; c++)
zfs_agent_iter_vdev(zhp, child[c], gsp);
return;
for (c = 0; c < children; c++) {
if (zfs_agent_iter_vdev(zhp, child[c], gsp)) {
gsp->gs_vdev_type = DEVICE_TYPE_PRIMARY;
return (B_TRUE);
}
}
}
/*
* On a devid match, grab the vdev guid
* Iterate over any spares and cache devices
*/
if ((gsp->gs_vdev_guid == 0) &&
if (nvlist_lookup_nvlist_array(nvl, ZPOOL_CONFIG_SPARES,
&child, &children) == 0) {
for (c = 0; c < children; c++) {
if (zfs_agent_iter_vdev(zhp, child[c], gsp)) {
gsp->gs_vdev_type = DEVICE_TYPE_L2ARC;
return (B_TRUE);
}
}
}
if (nvlist_lookup_nvlist_array(nvl, ZPOOL_CONFIG_L2CACHE,
&child, &children) == 0) {
for (c = 0; c < children; c++) {
if (zfs_agent_iter_vdev(zhp, child[c], gsp)) {
gsp->gs_vdev_type = DEVICE_TYPE_SPARE;
return (B_TRUE);
}
}
}
/*
* On a devid match, grab the vdev guid and expansion time, if any.
*/
if (gsp->gs_devid != NULL &&
(nvlist_lookup_string(nvl, ZPOOL_CONFIG_DEVID, &path) == 0) &&
(strcmp(gsp->gs_devid, path) == 0)) {
(void) nvlist_lookup_uint64(nvl, ZPOOL_CONFIG_GUID,
&gsp->gs_vdev_guid);
(void) nvlist_lookup_uint64(nvl, ZPOOL_CONFIG_EXPANSION_TIME,
&gsp->gs_vdev_expandtime);
return (B_TRUE);
}
return (B_FALSE);
}
static int
@@ -99,7 +141,7 @@ zfs_agent_iter_pool(zpool_handle_t *zhp, void *arg)
if ((config = zpool_get_config(zhp, NULL)) != NULL) {
if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE,
&nvl) == 0) {
zfs_agent_iter_vdev(zhp, nvl, gsp);
(void) zfs_agent_iter_vdev(zhp, nvl, gsp);
}
}
/*
@@ -148,6 +190,8 @@ zfs_agent_post_event(const char *class, const char *subclass, nvlist_t *nvl)
struct timeval tv;
int64_t tod[2];
uint64_t pool_guid = 0, vdev_guid = 0;
guid_search_t search = { 0 };
device_type_t devtype = DEVICE_TYPE_PRIMARY;
class = "resource.fs.zfs.removed";
subclass = "";
@@ -156,30 +200,55 @@ zfs_agent_post_event(const char *class, const char *subclass, nvlist_t *nvl)
(void) nvlist_lookup_uint64(nvl, ZFS_EV_POOL_GUID, &pool_guid);
(void) nvlist_lookup_uint64(nvl, ZFS_EV_VDEV_GUID, &vdev_guid);
(void) gettimeofday(&tv, NULL);
tod[0] = tv.tv_sec;
tod[1] = tv.tv_usec;
(void) nvlist_add_int64_array(payload, FM_EREPORT_TIME, tod, 2);
/*
* For multipath, ZFS_EV_VDEV_GUID is missing so find it.
* For multipath, spare and l2arc devices ZFS_EV_VDEV_GUID or
* ZFS_EV_POOL_GUID may be missing so find them.
*/
if (vdev_guid == 0) {
guid_search_t search = { 0 };
(void) nvlist_lookup_string(nvl, DEV_IDENTIFIER,
&search.gs_devid);
(void) zpool_iter(g_zfs_hdl, zfs_agent_iter_pool, &search);
pool_guid = search.gs_pool_guid;
vdev_guid = search.gs_vdev_guid;
devtype = search.gs_vdev_type;
(void) nvlist_lookup_string(nvl, DEV_IDENTIFIER,
&search.gs_devid);
(void) zpool_iter(g_zfs_hdl, zfs_agent_iter_pool,
&search);
pool_guid = search.gs_pool_guid;
vdev_guid = search.gs_vdev_guid;
/*
* We want to avoid reporting "remove" events coming from
* libudev for VDEVs which were expanded recently (10s) and
* avoid activating spares in response to partitions being
* deleted and created in rapid succession.
*/
if (search.gs_vdev_expandtime != 0 &&
search.gs_vdev_expandtime + 10 > tv.tv_sec) {
zed_log_msg(LOG_INFO, "agent post event: ignoring '%s' "
"for recently expanded device '%s'", EC_DEV_REMOVE,
search.gs_devid);
goto out;
}
(void) nvlist_add_uint64(payload,
FM_EREPORT_PAYLOAD_ZFS_POOL_GUID, pool_guid);
(void) nvlist_add_uint64(payload,
FM_EREPORT_PAYLOAD_ZFS_VDEV_GUID, vdev_guid);
(void) gettimeofday(&tv, NULL);
tod[0] = tv.tv_sec;
tod[1] = tv.tv_usec;
(void) nvlist_add_int64_array(payload, FM_EREPORT_TIME, tod, 2);
switch (devtype) {
case DEVICE_TYPE_L2ARC:
(void) nvlist_add_string(payload,
FM_EREPORT_PAYLOAD_ZFS_VDEV_TYPE,
VDEV_TYPE_L2CACHE);
break;
case DEVICE_TYPE_SPARE:
(void) nvlist_add_string(payload,
FM_EREPORT_PAYLOAD_ZFS_VDEV_TYPE, VDEV_TYPE_SPARE);
break;
case DEVICE_TYPE_PRIMARY:
(void) nvlist_add_string(payload,
FM_EREPORT_PAYLOAD_ZFS_VDEV_TYPE, VDEV_TYPE_DISK);
break;
}
zed_log_msg(LOG_INFO, "agent post event: mapping '%s' to '%s'",
EC_DEV_REMOVE, class);
@@ -193,6 +262,7 @@ zfs_agent_post_event(const char *class, const char *subclass, nvlist_t *nvl)
list_insert_tail(&agent_events, event);
(void) pthread_mutex_unlock(&agent_lock);
out:
(void) pthread_cond_signal(&agent_cond);
}
@@ -350,19 +420,3 @@ zfs_agent_fini(void)
g_zfs_hdl = NULL;
}
/*
* In ZED context, all the FMA agents run in the same thread
* and do not require a unique libzfs instance. Modules should
* use these stubs.
*/
libzfs_handle_t *
__libzfs_init(void)
{
return (g_zfs_hdl);
}
void
__libzfs_fini(libzfs_handle_t *hdl)
{
}
-7
View File
@@ -39,13 +39,6 @@ extern int zfs_slm_init(void);
extern void zfs_slm_fini(void);
extern void zfs_slm_event(const char *, const char *, nvlist_t *);
/*
* In ZED context, all the FMA agents run in the same thread
* and do not require a unique libzfs instance.
*/
extern libzfs_handle_t *__libzfs_init(void);
extern void __libzfs_fini(libzfs_handle_t *);
#ifdef __cplusplus
}
#endif
+12 -73
View File
@@ -26,6 +26,7 @@
*/
#include <stddef.h>
#include <string.h>
#include <strings.h>
#include <libuutil.h>
#include <libzfs.h>
@@ -167,14 +168,12 @@ zfs_case_unserialize(fmd_hdl_t *hdl, fmd_case_t *cp)
static void
zfs_mark_vdev(uint64_t pool_guid, nvlist_t *vd, er_timeval_t *loaded)
{
uint64_t vdev_guid;
uint64_t vdev_guid = 0;
uint_t c, children;
nvlist_t **child;
zfs_case_t *zcp;
int ret;
ret = nvlist_lookup_uint64(vd, ZPOOL_CONFIG_GUID, &vdev_guid);
assert(ret == 0);
(void) nvlist_lookup_uint64(vd, ZPOOL_CONFIG_GUID, &vdev_guid);
/*
* Mark any cases associated with this (pool, vdev) pair.
@@ -253,7 +252,10 @@ zfs_mark_pool(zpool_handle_t *zhp, void *unused)
}
ret = nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE, &vd);
assert(ret == 0);
if (ret) {
zpool_close(zhp);
return (-1);
}
zfs_mark_vdev(pool_guid, vd, &loaded);
@@ -377,11 +379,6 @@ zfs_case_solve(fmd_hdl_t *hdl, zfs_case_t *zcp, const char *faultname,
nvlist_t *detector, *fault;
boolean_t serialize;
nvlist_t *fru = NULL;
#ifdef HAVE_LIBTOPO
nvlist_t *fmri;
topo_hdl_t *thp;
int err;
#endif
fmd_hdl_debug(hdl, "solving fault '%s'", faultname);
/*
@@ -400,64 +397,6 @@ zfs_case_solve(fmd_hdl_t *hdl, zfs_case_t *zcp, const char *faultname,
zcp->zc_data.zc_vdev_guid);
}
#ifdef HAVE_LIBTOPO
/*
* We also want to make sure that the detector (pool or vdev) properly
* reflects the diagnosed state, when the fault corresponds to internal
* ZFS state (i.e. not checksum or I/O error-induced). Otherwise, a
* device which was unavailable early in boot (because the driver/file
* wasn't available) and is now healthy will be mis-diagnosed.
*/
if (!fmd_nvl_fmri_present(hdl, detector) ||
(checkunusable && !fmd_nvl_fmri_unusable(hdl, detector))) {
fmd_case_close(hdl, zcp->zc_case);
nvlist_free(detector);
return;
}
fru = NULL;
if (zcp->zc_fru != NULL &&
(thp = fmd_hdl_topo_hold(hdl, TOPO_VERSION)) != NULL) {
/*
* If the vdev had an associated FRU, then get the FRU nvlist
* from the topo handle and use that in the suspect list. We
* explicitly lookup the FRU because the fmri reported from the
* kernel may not have up to date details about the disk itself
* (serial, part, etc).
*/
if (topo_fmri_str2nvl(thp, zcp->zc_fru, &fmri, &err) == 0) {
libzfs_handle_t *zhdl = fmd_hdl_getspecific(hdl);
/*
* If the disk is part of the system chassis, but the
* FRU indicates a different chassis ID than our
* current system, then ignore the error. This
* indicates that the device was part of another
* cluster head, and for obvious reasons cannot be
* imported on this system.
*/
if (libzfs_fru_notself(zhdl, zcp->zc_fru)) {
fmd_case_close(hdl, zcp->zc_case);
nvlist_free(fmri);
fmd_hdl_topo_rele(hdl, thp);
nvlist_free(detector);
return;
}
/*
* If the device is no longer present on the system, or
* topo_fmri_fru() fails for other reasons, then fall
* back to the fmri specified in the vdev.
*/
if (topo_fmri_fru(thp, fmri, &fru, &err) != 0)
fru = fmd_nvl_dup(hdl, fmri, FMD_SLEEP);
nvlist_free(fmri);
}
fmd_hdl_topo_rele(hdl, thp);
}
#endif
fault = fmd_nvl_create_fault(hdl, faultname, 100, detector,
fru, detector);
fmd_case_add_suspect(hdl, zcp->zc_case, fault);
@@ -982,27 +921,27 @@ _zfs_diagnosis_init(fmd_hdl_t *hdl)
{
libzfs_handle_t *zhdl;
if ((zhdl = __libzfs_init()) == NULL)
if ((zhdl = libzfs_init()) == NULL)
return;
if ((zfs_case_pool = uu_list_pool_create("zfs_case_pool",
sizeof (zfs_case_t), offsetof(zfs_case_t, zc_node),
NULL, UU_LIST_POOL_DEBUG)) == NULL) {
__libzfs_fini(zhdl);
libzfs_fini(zhdl);
return;
}
if ((zfs_cases = uu_list_create(zfs_case_pool, NULL,
UU_LIST_DEBUG)) == NULL) {
uu_list_pool_destroy(zfs_case_pool);
__libzfs_fini(zhdl);
libzfs_fini(zhdl);
return;
}
if (fmd_hdl_register(hdl, FMD_API_VERSION, &fmd_info) != 0) {
uu_list_destroy(zfs_cases);
uu_list_pool_destroy(zfs_case_pool);
__libzfs_fini(zhdl);
libzfs_fini(zhdl);
return;
}
@@ -1038,5 +977,5 @@ _zfs_diagnosis_fini(fmd_hdl_t *hdl)
uu_list_pool_destroy(zfs_case_pool);
zhdl = fmd_hdl_getspecific(hdl);
__libzfs_fini(zhdl);
libzfs_fini(zhdl);
}
+98 -59
View File
@@ -23,6 +23,7 @@
* Copyright (c) 2012 by Delphix. All rights reserved.
* Copyright 2014 Nexenta Systems, Inc. All rights reserved.
* Copyright (c) 2016, 2017, Intel Corporation.
* Copyright (c) 2017 Open-E, Inc. All Rights Reserved.
*/
/*
@@ -63,7 +64,6 @@
* trigger the FMA fault that we skipped earlier.
*
* ZFS on Linux porting notes:
* In lieu of a thread pool, just spawn a thread on demmand.
* Linux udev provides a disk insert for both the disk and the partition
*
*/
@@ -73,6 +73,7 @@
#include <fcntl.h>
#include <libnvpair.h>
#include <libzfs.h>
#include <libzutil.h>
#include <limits.h>
#include <stddef.h>
#include <stdlib.h>
@@ -82,8 +83,10 @@
#include <sys/sunddi.h>
#include <sys/sysevent/eventdefs.h>
#include <sys/sysevent/dev.h>
#include <thread_pool.h>
#include <pthread.h>
#include <unistd.h>
#include <errno.h>
#include "zfs_agents.h"
#include "../zed_log.h"
@@ -96,12 +99,12 @@ typedef void (*zfs_process_func_t)(zpool_handle_t *, nvlist_t *, boolean_t);
libzfs_handle_t *g_zfshdl;
list_t g_pool_list; /* list of unavailable pools at initialization */
list_t g_device_list; /* list of disks with asynchronous label request */
tpool_t *g_tpool;
boolean_t g_enumeration_done;
pthread_t g_zfs_tid;
pthread_t g_zfs_tid; /* zfs_enum_pools() thread */
typedef struct unavailpool {
zpool_handle_t *uap_zhp;
pthread_t uap_enable_tid; /* dataset enable thread if activated */
list_node_t uap_node;
} unavailpool_t;
@@ -134,7 +137,6 @@ zfs_unavail_pool(zpool_handle_t *zhp, void *data)
unavailpool_t *uap;
uap = malloc(sizeof (unavailpool_t));
uap->uap_zhp = zhp;
uap->uap_enable_tid = 0;
list_insert_tail((list_t *)data, uap);
} else {
zpool_close(zhp);
@@ -426,8 +428,16 @@ zfs_process_add(zpool_handle_t *zhp, nvlist_t *vdev, boolean_t labeled)
nvlist_free(newvd);
/*
* auto replace a leaf disk at same physical location
* Wait for udev to verify the links exist, then auto-replace
* the leaf disk at same physical location.
*/
if (zpool_label_disk_wait(path, 3000) != 0) {
zed_log_msg(LOG_WARNING, "zfs_mod: expected replacement "
"disk %s is missing", path);
nvlist_free(nvroot);
return;
}
ret = zpool_vdev_attach(zhp, fullpath, path, nvroot, B_TRUE);
zed_log_msg(LOG_INFO, " zpool_vdev_replace: %s with %s (%s)",
@@ -466,7 +476,20 @@ zfs_iter_vdev(zpool_handle_t *zhp, nvlist_t *nvl, void *data)
&child, &children) == 0) {
for (c = 0; c < children; c++)
zfs_iter_vdev(zhp, child[c], data);
return;
}
/*
* Iterate over any spares and cache devices
*/
if (nvlist_lookup_nvlist_array(nvl, ZPOOL_CONFIG_SPARES,
&child, &children) == 0) {
for (c = 0; c < children; c++)
zfs_iter_vdev(zhp, child[c], data);
}
if (nvlist_lookup_nvlist_array(nvl, ZPOOL_CONFIG_L2CACHE,
&child, &children) == 0) {
for (c = 0; c < children; c++)
zfs_iter_vdev(zhp, child[c], data);
}
/* once a vdev was matched and processed there is nothing left to do */
@@ -511,19 +534,14 @@ zfs_iter_vdev(zpool_handle_t *zhp, nvlist_t *nvl, void *data)
(dp->dd_func)(zhp, nvl, dp->dd_islabeled);
}
static void *
void
zfs_enable_ds(void *arg)
{
unavailpool_t *pool = (unavailpool_t *)arg;
assert(pool->uap_enable_tid = pthread_self());
(void) zpool_enable_datasets(pool->uap_zhp, NULL, 0);
zpool_close(pool->uap_zhp);
pool->uap_zhp = NULL;
/* Note: zfs_slm_fini() will cleanup this pool entry on exit */
return (NULL);
free(pool);
}
static int
@@ -558,15 +576,13 @@ zfs_iter_pool(zpool_handle_t *zhp, void *data)
for (pool = list_head(&g_pool_list); pool != NULL;
pool = list_next(&g_pool_list, pool)) {
if (pool->uap_enable_tid != 0)
continue; /* entry already processed */
if (strcmp(zpool_get_name(zhp),
zpool_get_name(pool->uap_zhp)))
continue;
if (zfs_toplevel_state(zhp) >= VDEV_STATE_DEGRADED) {
/* send to a background thread; keep on list */
(void) pthread_create(&pool->uap_enable_tid,
NULL, zfs_enable_ds, pool);
list_remove(&g_pool_list, pool);
(void) tpool_dispatch(g_tpool, zfs_enable_ds,
pool);
break;
}
}
@@ -703,8 +719,8 @@ zfsdle_vdev_online(zpool_handle_t *zhp, void *data)
{
char *devname = data;
boolean_t avail_spare, l2cache;
vdev_state_t newstate;
nvlist_t *tgt;
int error;
zed_log_msg(LOG_INFO, "zfsdle_vdev_online: searching for '%s' in '%s'",
devname, zpool_get_name(zhp));
@@ -712,40 +728,58 @@ zfsdle_vdev_online(zpool_handle_t *zhp, void *data)
if ((tgt = zpool_find_vdev_by_physpath(zhp, devname,
&avail_spare, &l2cache, NULL)) != NULL) {
char *path, fullpath[MAXPATHLEN];
uint64_t wholedisk = 0ULL;
uint64_t wholedisk;
verify(nvlist_lookup_string(tgt, ZPOOL_CONFIG_PATH,
&path) == 0);
verify(nvlist_lookup_uint64(tgt, ZPOOL_CONFIG_WHOLE_DISK,
&wholedisk) == 0);
error = nvlist_lookup_string(tgt, ZPOOL_CONFIG_PATH, &path);
if (error) {
zpool_close(zhp);
return (0);
}
error = nvlist_lookup_uint64(tgt, ZPOOL_CONFIG_WHOLE_DISK,
&wholedisk);
if (error)
wholedisk = 0;
(void) strlcpy(fullpath, path, sizeof (fullpath));
if (wholedisk) {
char *spath = zfs_strip_partition(fullpath);
if (!spath) {
zed_log_msg(LOG_INFO, "%s: Can't alloc",
__func__);
path = strrchr(path, '/');
if (path != NULL) {
path = zfs_strip_partition(path + 1);
if (path == NULL) {
zpool_close(zhp);
return (0);
}
} else {
zpool_close(zhp);
return (0);
}
(void) strlcpy(fullpath, spath, sizeof (fullpath));
free(spath);
(void) strlcpy(fullpath, path, sizeof (fullpath));
free(path);
/*
* We need to reopen the pool associated with this
* device so that the kernel can update the size
* of the expanded device.
* device so that the kernel can update the size of
* the expanded device. When expanding there is no
* need to restart the scrub from the beginning.
*/
(void) zpool_reopen(zhp);
boolean_t scrub_restart = B_FALSE;
(void) zpool_reopen_one(zhp, &scrub_restart);
} else {
(void) strlcpy(fullpath, path, sizeof (fullpath));
}
if (zpool_get_prop_int(zhp, ZPOOL_PROP_AUTOEXPAND, NULL)) {
zed_log_msg(LOG_INFO, "zfsdle_vdev_online: setting "
"device '%s' to ONLINE state in pool '%s'",
fullpath, zpool_get_name(zhp));
if (zpool_get_state(zhp) != POOL_STATE_UNAVAIL)
(void) zpool_vdev_online(zhp, fullpath, 0,
vdev_state_t newstate;
if (zpool_get_state(zhp) != POOL_STATE_UNAVAIL) {
error = zpool_vdev_online(zhp, fullpath, 0,
&newstate);
zed_log_msg(LOG_INFO, "zfsdle_vdev_online: "
"setting device '%s' to ONLINE state "
"in pool '%s': %d", fullpath,
zpool_get_name(zhp), error);
}
}
zpool_close(zhp);
return (1);
@@ -755,23 +789,32 @@ zfsdle_vdev_online(zpool_handle_t *zhp, void *data)
}
/*
* This function handles the ESC_DEV_DLE event.
* This function handles the ESC_DEV_DLE device change event. Use the
* provided vdev guid when looking up a disk or partition, when the guid
* is not present assume the entire disk is owned by ZFS and append the
* expected -part1 partition information then lookup by physical path.
*/
static int
zfs_deliver_dle(nvlist_t *nvl)
{
char *devname;
char *devname, name[MAXPATHLEN];
uint64_t guid;
if (nvlist_lookup_string(nvl, DEV_PHYS_PATH, &devname) != 0) {
zed_log_msg(LOG_INFO, "zfs_deliver_dle: no physpath");
return (-1);
if (nvlist_lookup_uint64(nvl, ZFS_EV_VDEV_GUID, &guid) == 0) {
sprintf(name, "%llu", (u_longlong_t)guid);
} else if (nvlist_lookup_string(nvl, DEV_PHYS_PATH, &devname) == 0) {
strlcpy(name, devname, MAXPATHLEN);
zfs_append_partition(name, MAXPATHLEN);
} else {
zed_log_msg(LOG_INFO, "zfs_deliver_dle: no guid or physpath");
}
if (zpool_iter(g_zfshdl, zfsdle_vdev_online, devname) != 1) {
if (zpool_iter(g_zfshdl, zfsdle_vdev_online, name) != 1) {
zed_log_msg(LOG_INFO, "zfs_deliver_dle: device '%s' not "
"found", devname);
"found", name);
return (1);
}
return (0);
}
@@ -854,7 +897,7 @@ zfs_enum_pools(void *arg)
int
zfs_slm_init()
{
if ((g_zfshdl = __libzfs_init()) == NULL)
if ((g_zfshdl = libzfs_init()) == NULL)
return (-1);
/*
@@ -866,7 +909,7 @@ zfs_slm_init()
if (pthread_create(&g_zfs_tid, NULL, zfs_enum_pools, NULL) != 0) {
list_destroy(&g_pool_list);
__libzfs_fini(g_zfshdl);
libzfs_fini(g_zfshdl);
return (-1);
}
@@ -884,19 +927,15 @@ zfs_slm_fini()
/* wait for zfs_enum_pools thread to complete */
(void) pthread_join(g_zfs_tid, NULL);
/* destroy the thread pool */
if (g_tpool != NULL) {
tpool_wait(g_tpool);
tpool_destroy(g_tpool);
}
while ((pool = (list_head(&g_pool_list))) != NULL) {
/*
* each pool entry has two possibilities
* 1. was made available (so wait for zfs_enable_ds thread)
* 2. still unavailable (just close the pool)
*/
if (pool->uap_enable_tid)
(void) pthread_join(pool->uap_enable_tid, NULL);
else if (pool->uap_zhp != NULL)
zpool_close(pool->uap_zhp);
list_remove(&g_pool_list, pool);
zpool_close(pool->uap_zhp);
free(pool);
}
list_destroy(&g_pool_list);
@@ -907,7 +946,7 @@ zfs_slm_fini()
}
list_destroy(&g_device_list);
__libzfs_fini(g_zfshdl);
libzfs_fini(g_zfshdl);
}
void
+67 -154
View File
@@ -22,6 +22,7 @@
* Copyright (c) 2006, 2010, Oracle and/or its affiliates. All rights reserved.
*
* Copyright (c) 2016, Intel Corporation.
* Copyright (c) 2018, loli10K <ezomori.nozomu@gmail.com>
*/
/*
@@ -71,7 +72,6 @@ zfs_retire_clear_data(fmd_hdl_t *hdl, zfs_retire_data_t *zdp)
*/
typedef struct find_cbdata {
uint64_t cb_guid;
const char *cb_fru;
zpool_handle_t *cb_zhp;
nvlist_t *cb_vdev;
} find_cbdata_t;
@@ -95,26 +95,18 @@ find_pool(zpool_handle_t *zhp, void *data)
* Find a vdev within a tree with a matching GUID.
*/
static nvlist_t *
find_vdev(libzfs_handle_t *zhdl, nvlist_t *nv, const char *search_fru,
uint64_t search_guid)
find_vdev(libzfs_handle_t *zhdl, nvlist_t *nv, uint64_t search_guid)
{
uint64_t guid;
nvlist_t **child;
uint_t c, children;
nvlist_t *ret;
char *fru;
if (search_fru != NULL) {
if (nvlist_lookup_string(nv, ZPOOL_CONFIG_FRU, &fru) == 0 &&
libzfs_fru_compare(zhdl, fru, search_fru))
return (nv);
} else {
if (nvlist_lookup_uint64(nv, ZPOOL_CONFIG_GUID, &guid) == 0 &&
guid == search_guid) {
fmd_hdl_debug(fmd_module_hdl("zfs-retire"),
"matched vdev %llu", guid);
return (nv);
}
if (nvlist_lookup_uint64(nv, ZPOOL_CONFIG_GUID, &guid) == 0 &&
guid == search_guid) {
fmd_hdl_debug(fmd_module_hdl("zfs-retire"),
"matched vdev %llu", guid);
return (nv);
}
if (nvlist_lookup_nvlist_array(nv, ZPOOL_CONFIG_CHILDREN,
@@ -122,8 +114,7 @@ find_vdev(libzfs_handle_t *zhdl, nvlist_t *nv, const char *search_fru,
return (NULL);
for (c = 0; c < children; c++) {
if ((ret = find_vdev(zhdl, child[c], search_fru,
search_guid)) != NULL)
if ((ret = find_vdev(zhdl, child[c], search_guid)) != NULL)
return (ret);
}
@@ -132,8 +123,16 @@ find_vdev(libzfs_handle_t *zhdl, nvlist_t *nv, const char *search_fru,
return (NULL);
for (c = 0; c < children; c++) {
if ((ret = find_vdev(zhdl, child[c], search_fru,
search_guid)) != NULL)
if ((ret = find_vdev(zhdl, child[c], search_guid)) != NULL)
return (ret);
}
if (nvlist_lookup_nvlist_array(nv, ZPOOL_CONFIG_SPARES,
&child, &children) != 0)
return (NULL);
for (c = 0; c < children; c++) {
if ((ret = find_vdev(zhdl, child[c], search_guid)) != NULL)
return (ret);
}
@@ -167,8 +166,7 @@ find_by_guid(libzfs_handle_t *zhdl, uint64_t pool_guid, uint64_t vdev_guid,
}
if (vdev_guid != 0) {
if ((*vdevp = find_vdev(zhdl, nvroot, NULL,
vdev_guid)) == NULL) {
if ((*vdevp = find_vdev(zhdl, nvroot, vdev_guid)) == NULL) {
zpool_close(zhp);
return (NULL);
}
@@ -177,72 +175,37 @@ find_by_guid(libzfs_handle_t *zhdl, uint64_t pool_guid, uint64_t vdev_guid,
return (zhp);
}
#ifdef HAVE_LIBTOPO
static int
search_pool(zpool_handle_t *zhp, void *data)
{
find_cbdata_t *cbp = data;
nvlist_t *config;
nvlist_t *nvroot;
config = zpool_get_config(zhp, NULL);
if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE,
&nvroot) != 0) {
zpool_close(zhp);
return (0);
}
if ((cbp->cb_vdev = find_vdev(zpool_get_handle(zhp), nvroot,
cbp->cb_fru, 0)) != NULL) {
cbp->cb_zhp = zhp;
return (1);
}
zpool_close(zhp);
return (0);
}
/*
* Given a FRU FMRI, find the matching pool and vdev.
*/
static zpool_handle_t *
find_by_fru(libzfs_handle_t *zhdl, const char *fru, nvlist_t **vdevp)
{
find_cbdata_t cb;
cb.cb_fru = fru;
cb.cb_zhp = NULL;
if (zpool_iter(zhdl, search_pool, &cb) != 1)
return (NULL);
*vdevp = cb.cb_vdev;
return (cb.cb_zhp);
}
#endif /* HAVE_LIBTOPO */
/*
* Given a vdev, attempt to replace it with every known spare until one
* succeeds.
* succeeds or we run out of devices to try.
* Return whether we were successful or not in replacing the device.
*/
static void
static boolean_t
replace_with_spare(fmd_hdl_t *hdl, zpool_handle_t *zhp, nvlist_t *vdev)
{
nvlist_t *config, *nvroot, *replacement;
nvlist_t **spares;
uint_t s, nspares;
char *dev_name;
zprop_source_t source;
int ashift;
config = zpool_get_config(zhp, NULL);
if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_VDEV_TREE,
&nvroot) != 0)
return;
return (B_FALSE);
/*
* Find out if there are any hot spares available in the pool.
*/
if (nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_SPARES,
&spares, &nspares) != 0)
return;
return (B_FALSE);
/*
* lookup "ashift" pool property, we may need it for the replacement
*/
ashift = zpool_get_prop_int(zhp, ZPOOL_PROP_ASHIFT, &source);
replacement = fmd_nvl_alloc(hdl, FMD_SLEEP);
@@ -262,6 +225,11 @@ replace_with_spare(fmd_hdl_t *hdl, zpool_handle_t *zhp, nvlist_t *vdev)
&spare_name) != 0)
continue;
/* if set, add the "ashift" pool property to the spare nvlist */
if (source != ZPROP_SRC_DEFAULT)
(void) nvlist_add_uint64(spares[s],
ZPOOL_CONFIG_ASHIFT, ashift);
(void) nvlist_add_nvlist_array(replacement,
ZPOOL_CONFIG_CHILDREN, &spares[s], 1);
@@ -269,12 +237,17 @@ replace_with_spare(fmd_hdl_t *hdl, zpool_handle_t *zhp, nvlist_t *vdev)
dev_name, basename(spare_name));
if (zpool_vdev_attach(zhp, dev_name, spare_name,
replacement, B_TRUE) == 0)
break;
replacement, B_TRUE) == 0) {
free(dev_name);
nvlist_free(replacement);
return (B_TRUE);
}
}
free(dev_name);
nvlist_free(replacement);
return (B_FALSE);
}
/*
@@ -289,10 +262,6 @@ zfs_vdev_repair(fmd_hdl_t *hdl, nvlist_t *nvl)
zfs_retire_data_t *zdp = fmd_hdl_getspecific(hdl);
zfs_retire_repaired_t *zrp;
uint64_t pool_guid, vdev_guid;
#ifdef HAVE_LIBTOPO
nvlist_t *asru;
#endif
if (nvlist_lookup_uint64(nvl, FM_EREPORT_PAYLOAD_ZFS_POOL_GUID,
&pool_guid) != 0 || nvlist_lookup_uint64(nvl,
FM_EREPORT_PAYLOAD_ZFS_VDEV_GUID, &vdev_guid) != 0)
@@ -315,47 +284,6 @@ zfs_vdev_repair(fmd_hdl_t *hdl, nvlist_t *nvl)
return;
}
#ifdef HAVE_LIBTOPO
asru = fmd_nvl_alloc(hdl, FMD_SLEEP);
(void) nvlist_add_uint8(asru, FM_VERSION, ZFS_SCHEME_VERSION0);
(void) nvlist_add_string(asru, FM_FMRI_SCHEME, FM_FMRI_SCHEME_ZFS);
(void) nvlist_add_uint64(asru, FM_FMRI_ZFS_POOL, pool_guid);
(void) nvlist_add_uint64(asru, FM_FMRI_ZFS_VDEV, vdev_guid);
/*
* We explicitly check for the unusable state here to make sure we
* aren't responding to a transient state change. As part of opening a
* vdev, it's possible to see the 'statechange' event, only to be
* followed by a vdev failure later. If we don't check the current
* state of the vdev (or pool) before marking it repaired, then we risk
* generating spurious repair events followed immediately by the same
* diagnosis.
*
* This assumes that the ZFS scheme code associated unusable (i.e.
* isolated) with its own definition of faulty state. In the case of a
* DEGRADED leaf vdev (due to checksum errors), this is not the case.
* This works, however, because the transient state change is not
* posted in this case. This could be made more explicit by not
* relying on the scheme's unusable callback and instead directly
* checking the vdev state, where we could correctly account for
* DEGRADED state.
*/
if (!fmd_nvl_fmri_unusable(hdl, asru) && fmd_nvl_fmri_has_fault(hdl,
asru, FMD_HAS_FAULT_ASRU, NULL)) {
topo_hdl_t *thp;
char *fmri = NULL;
int err;
thp = fmd_hdl_topo_hold(hdl, TOPO_VERSION);
if (topo_fmri_nvl2str(thp, asru, &fmri, &err) == 0)
(void) fmd_repair_asru(hdl, fmri);
fmd_hdl_topo_rele(hdl, thp);
topo_hdl_strfree(thp, fmri);
}
nvlist_free(asru);
#endif
zrp = fmd_hdl_alloc(hdl, sizeof (zfs_retire_repaired_t), FMD_SLEEP);
zrp->zrr_next = zdp->zrd_repaired;
zrp->zrr_pool = pool_guid;
@@ -392,10 +320,14 @@ zfs_retire_recv(fmd_hdl_t *hdl, fmd_event_t *ep, nvlist_t *nvl,
fmd_hdl_debug(hdl, "zfs_retire_recv: '%s'", class);
/*
* If this is a resource notifying us of device removal, then simply
* check for an available spare and continue.
* If this is a resource notifying us of device removal then simply
* check for an available spare and continue unless the device is a
* l2arc vdev, in which case we just offline it.
*/
if (strcmp(class, "resource.fs.zfs.removed") == 0) {
char *devtype;
char *devname;
if (nvlist_lookup_uint64(nvl, FM_EREPORT_PAYLOAD_ZFS_POOL_GUID,
&pool_guid) != 0 ||
nvlist_lookup_uint64(nvl, FM_EREPORT_PAYLOAD_ZFS_VDEV_GUID,
@@ -406,8 +338,21 @@ zfs_retire_recv(fmd_hdl_t *hdl, fmd_event_t *ep, nvlist_t *nvl,
&vdev)) == NULL)
return;
if (fmd_prop_get_int32(hdl, "spare_on_remove"))
replace_with_spare(hdl, zhp, vdev);
devname = zpool_vdev_name(NULL, zhp, vdev, B_FALSE);
/* Can't replace l2arc with a spare: offline the device */
if (nvlist_lookup_string(nvl, FM_EREPORT_PAYLOAD_ZFS_VDEV_TYPE,
&devtype) == 0 && strcmp(devtype, VDEV_TYPE_L2CACHE) == 0) {
fmd_hdl_debug(hdl, "zpool_vdev_offline '%s'", devname);
zpool_vdev_offline(zhp, devname, B_TRUE);
} else if (!fmd_prop_get_int32(hdl, "spare_on_remove") ||
replace_with_spare(hdl, zhp, vdev) == B_FALSE) {
/* Could not handle with spare: offline the device */
fmd_hdl_debug(hdl, "zpool_vdev_offline '%s'", devname);
zpool_vdev_offline(zhp, devname, B_TRUE);
}
free(devname);
zpool_close(zhp);
return;
}
@@ -477,39 +422,7 @@ zfs_retire_recv(fmd_hdl_t *hdl, fmd_event_t *ep, nvlist_t *nvl,
}
if (is_disk) {
#ifdef HAVE_LIBTOPO
/*
* This is a disk fault. Lookup the FRU, convert it to
* an FMRI string, and attempt to find a matching vdev.
*/
if (nvlist_lookup_nvlist(fault, FM_FAULT_FRU,
&fru) != 0 ||
nvlist_lookup_string(fru, FM_FMRI_SCHEME,
&scheme) != 0)
continue;
if (strcmp(scheme, FM_FMRI_SCHEME_HC) != 0)
continue;
thp = fmd_hdl_topo_hold(hdl, TOPO_VERSION);
if (topo_fmri_nvl2str(thp, fru, &fmri, &err) != 0) {
fmd_hdl_topo_rele(hdl, thp);
continue;
}
zhp = find_by_fru(zhdl, fmri, &vdev);
topo_hdl_strfree(thp, fmri);
fmd_hdl_topo_rele(hdl, thp);
if (zhp == NULL)
continue;
(void) nvlist_lookup_uint64(vdev,
ZPOOL_CONFIG_GUID, &vdev_guid);
aux = VDEV_AUX_EXTERNAL;
#else
continue;
#endif
} else {
/*
* This is a ZFS fault. Lookup the resource, and
@@ -583,7 +496,7 @@ zfs_retire_recv(fmd_hdl_t *hdl, fmd_event_t *ep, nvlist_t *nvl,
/*
* Attempt to substitute a hot spare.
*/
replace_with_spare(hdl, zhp, vdev);
(void) replace_with_spare(hdl, zhp, vdev);
zpool_close(zhp);
}
@@ -615,7 +528,7 @@ _zfs_retire_init(fmd_hdl_t *hdl)
zfs_retire_data_t *zdp;
libzfs_handle_t *zhdl;
if ((zhdl = __libzfs_init()) == NULL)
if ((zhdl = libzfs_init()) == NULL)
return;
if (fmd_hdl_register(hdl, FMD_API_VERSION, &fmd_info) != 0) {
@@ -636,7 +549,7 @@ _zfs_retire_fini(fmd_hdl_t *hdl)
if (zdp != NULL) {
zfs_retire_clear_data(hdl, zdp);
__libzfs_fini(zdp->zrd_hdl);
libzfs_fini(zdp->zrd_hdl);
fmd_hdl_free(hdl, zdp, sizeof (zfs_retire_data_t));
}
}
+1
View File
@@ -0,0 +1 @@
history_event-zfs-list-cacher.sh
+57
View File
@@ -0,0 +1,57 @@
include $(top_srcdir)/config/Rules.am
EXTRA_DIST = \
README \
history_event-zfs-list-cacher.sh.in
zedconfdir = $(sysconfdir)/zfs/zed.d
dist_zedconf_DATA = \
zed-functions.sh \
zed.rc
zedexecdir = $(zfsexecdir)/zed.d
dist_zedexec_SCRIPTS = \
all-debug.sh \
all-syslog.sh \
data-notify.sh \
generic-notify.sh \
resilver_finish-notify.sh \
scrub_finish-notify.sh \
statechange-led.sh \
statechange-notify.sh \
vdev_clear-led.sh \
vdev_attach-led.sh \
pool_import-led.sh \
resilver_finish-start-scrub.sh
nodist_zedexec_SCRIPTS = history_event-zfs-list-cacher.sh
$(nodist_zedexec_SCRIPTS): %: %.in
-$(SED) -e 's,@bindir\@,$(bindir),g' \
-e 's,@runstatedir\@,$(runstatedir),g' \
-e 's,@sbindir\@,$(sbindir),g' \
-e 's,@sysconfdir\@,$(sysconfdir),g' \
$< >'$@'
zedconfdefaults = \
all-syslog.sh \
data-notify.sh \
resilver_finish-notify.sh \
scrub_finish-notify.sh \
statechange-led.sh \
statechange-notify.sh \
vdev_clear-led.sh \
vdev_attach-led.sh \
pool_import-led.sh \
resilver_finish-start-scrub.sh
install-data-hook:
$(MKDIR_P) "$(DESTDIR)$(zedconfdir)"
for f in $(zedconfdefaults); do \
test -f "$(DESTDIR)$(zedconfdir)/$${f}" -o \
-L "$(DESTDIR)$(zedconfdir)/$${f}" || \
ln -s "$(zedexecdir)/$${f}" "$(DESTDIR)$(zedconfdir)"; \
done
chmod 0600 "$(DESTDIR)$(zedconfdir)/zed.rc"
+76
View File
@@ -0,0 +1,76 @@
#!/bin/sh
#
# Track changes to enumerated pools for use in early-boot
set -ef
FSLIST_DIR="@sysconfdir@/zfs/zfs-list.cache"
FSLIST_TMP="@runstatedir@/zfs-list.cache.new"
FSLIST="${FSLIST_DIR}/${ZEVENT_POOL}"
# If the pool specific cache file is not writeable, abort
[ -w "${FSLIST}" ] || exit 0
[ -f "${ZED_ZEDLET_DIR}/zed.rc" ] && . "${ZED_ZEDLET_DIR}/zed.rc"
. "${ZED_ZEDLET_DIR}/zed-functions.sh"
zed_exit_if_ignoring_this_event
zed_check_cmd "${ZFS}" sort diff grep
# If we are acting on a snapshot, we have nothing to do
printf '%s' "${ZEVENT_HISTORY_DSNAME}" | grep '@' && exit 0
# We obtain a lock on zfs-list to avoid any simultaneous writes.
# If we run into trouble, log and drop the lock
abort_alter() {
zed_log_msg "Error updating zfs-list.cache!"
zed_unlock zfs-list
}
finished() {
zed_unlock zfs-list
trap - EXIT
exit 0
}
case "${ZEVENT_HISTORY_INTERNAL_NAME}" in
create|"finish receiving"|import|destroy|rename)
;;
export)
zed_lock zfs-list
trap abort_alter EXIT
echo > "${FSLIST}"
finished
;;
set|inherit)
# Only act if one of the tracked properties is altered.
case "${ZEVENT_HISTORY_INTERNAL_STR%%=*}" in
canmount|mountpoint|atime|relatime|devices|exec| \
readonly|setuid|nbmand|encroot|keylocation) ;;
*) exit 0 ;;
esac
;;
*)
# Ignore all other events.
exit 0
;;
esac
zed_lock zfs-list
trap abort_alter EXIT
PROPS="name,mountpoint,canmount,atime,relatime,devices,exec,readonly"
PROPS="${PROPS},setuid,nbmand,encroot,keylocation"
"${ZFS}" list -H -t filesystem -o $PROPS -r "${ZEVENT_POOL}" > "${FSLIST_TMP}"
# Sort the output so that it is stable
sort "${FSLIST_TMP}" -o "${FSLIST_TMP}"
# Don't modify the file if it hasn't changed
diff -q "${FSLIST_TMP}" "${FSLIST}" || mv "${FSLIST_TMP}" "${FSLIST}"
rm -f "${FSLIST_TMP}"
finished
+1 -1
View File
@@ -165,7 +165,7 @@ process_pool()
fi
}
if [ ! -z "$ZEVENT_VDEV_ENC_SYSFS_PATH" ] && [ ! -z "$ZEVENT_VDEV_STATE_STR" ] ; then
if [ -n "$ZEVENT_VDEV_ENC_SYSFS_PATH" ] && [ -n "$ZEVENT_VDEV_STATE_STR" ] ; then
# Got a statechange for an individual VDEV
val=$(state_to_val "$ZEVENT_VDEV_STATE_STR")
vdev=$(basename "$ZEVENT_VDEV_PATH")
+1 -1
View File
@@ -434,7 +434,7 @@ zed_guid_to_pool()
fi
guid=$(printf "%llu" "$1")
if [ ! -z "$guid" ] ; then
if [ -n "$guid" ] ; then
$ZPOOL get -H -ovalue,name guid | awk '$1=='"$guid"' {print $2}'
fi
}
+4 -3
View File
@@ -52,9 +52,9 @@
##
# Send notifications for 'ereport.fs.zfs.data' events.
# Disabled by default
# Disabled by default, any non-empty value will enable the feature.
#
#ZED_NOTIFY_DATA=1
#ZED_NOTIFY_DATA=
##
# Pushbullet access token.
@@ -88,7 +88,8 @@ ZED_USE_ENCLOSURE_LEDS=1
##
# Run a scrub after every resilver
#ZED_SCRUB_AFTER_RESILVER=1
# Disabled by default, 1 to enable and 0 to disable.
#ZED_SCRUB_AFTER_RESILVER=0
##
# The syslog priority (e.g., specified as a "facility.level" pair).
+2 -1
View File
@@ -21,6 +21,7 @@
#include <libnvpair.h>
#include <libudev.h>
#include <libzfs.h>
#include <libzutil.h>
#include <pthread.h>
#include <stdlib.h>
#include <string.h>
@@ -37,7 +38,7 @@
* A libudev monitor is established to monitor block device actions and pass
* them on to internal ZED logic modules. Initially, zfs_mod.c is the only
* consumer and is the Linux equivalent for the illumos syseventd ZFS SLM
* module responsible for handeling disk events for ZFS.
* module responsible for handling disk events for ZFS.
*/
pthread_t g_mon_tid;
+3 -4
View File
@@ -10,13 +10,12 @@ zfs_SOURCES = \
zfs_iter.c \
zfs_iter.h \
zfs_main.c \
zfs_util.h
zfs_util.h \
zfs_project.c \
zfs_projectutil.h
zfs_LDADD = \
$(top_builddir)/lib/libnvpair/libnvpair.la \
$(top_builddir)/lib/libuutil/libuutil.la \
$(top_builddir)/lib/libzpool/libzpool.la \
$(top_builddir)/lib/libzfs/libzfs.la \
$(top_builddir)/lib/libzfs_core/libzfs_core.la
zfs_LDFLAGS = -pthread
+22 -6
View File
@@ -31,6 +31,7 @@
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <strings.h>
#include <libzfs.h>
@@ -133,16 +134,31 @@ zfs_callback(zfs_handle_t *zhp, void *data)
((cb->cb_flags & ZFS_ITER_DEPTH_LIMIT) == 0 ||
cb->cb_depth < cb->cb_depth_limit)) {
cb->cb_depth++;
if (zfs_get_type(zhp) == ZFS_TYPE_FILESYSTEM)
/*
* If we are not looking for filesystems, we don't need to
* recurse into filesystems when we are at our depth limit.
*/
if ((cb->cb_depth < cb->cb_depth_limit ||
(cb->cb_flags & ZFS_ITER_DEPTH_LIMIT) == 0 ||
(cb->cb_types &
(ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME))) &&
zfs_get_type(zhp) == ZFS_TYPE_FILESYSTEM) {
(void) zfs_iter_filesystems(zhp, zfs_callback, data);
}
if (((zfs_get_type(zhp) & (ZFS_TYPE_SNAPSHOT |
ZFS_TYPE_BOOKMARK)) == 0) && include_snaps)
ZFS_TYPE_BOOKMARK)) == 0) && include_snaps) {
(void) zfs_iter_snapshots(zhp,
(cb->cb_flags & ZFS_ITER_SIMPLE) != 0, zfs_callback,
data);
(cb->cb_flags & ZFS_ITER_SIMPLE) != 0,
zfs_callback, data, 0, 0);
}
if (((zfs_get_type(zhp) & (ZFS_TYPE_SNAPSHOT |
ZFS_TYPE_BOOKMARK)) == 0) && include_bmarks)
ZFS_TYPE_BOOKMARK)) == 0) && include_bmarks) {
(void) zfs_iter_bookmarks(zhp, zfs_callback, data);
}
cb->cb_depth--;
}
@@ -224,7 +240,7 @@ zfs_compare(const void *larg, const void *rarg, void *unused)
*rat = '\0';
ret = strcmp(lname, rname);
if (ret == 0) {
if (ret == 0 && (lat != NULL || rat != NULL)) {
/*
* If we're comparing a dataset to one of its snapshots, we
* always make the full dataset first.
+1195 -111
View File
File diff suppressed because it is too large Load Diff
+295
View File
@@ -0,0 +1,295 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2017, Intle Corporation. All rights reserved.
*/
#include <errno.h>
#include <getopt.h>
#include <stdio.h>
#include <stdlib.h>
#include <strings.h>
#include <unistd.h>
#include <fcntl.h>
#include <dirent.h>
#include <stddef.h>
#include <libintl.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/list.h>
#include <sys/zfs_project.h>
#include "zfs_util.h"
#include "zfs_projectutil.h"
typedef struct zfs_project_item {
list_node_t zpi_list;
char zpi_name[0];
} zfs_project_item_t;
static void
zfs_project_item_alloc(list_t *head, const char *name)
{
zfs_project_item_t *zpi;
zpi = safe_malloc(sizeof (zfs_project_item_t) + strlen(name) + 1);
strcpy(zpi->zpi_name, name);
list_insert_tail(head, zpi);
}
static int
zfs_project_sanity_check(const char *name, zfs_project_control_t *zpc,
struct stat *st)
{
int ret;
ret = stat(name, st);
if (ret) {
(void) fprintf(stderr, gettext("failed to stat %s: %s\n"),
name, strerror(errno));
return (ret);
}
if (!S_ISREG(st->st_mode) && !S_ISDIR(st->st_mode)) {
(void) fprintf(stderr, gettext("only support project quota on "
"regular file or directory\n"));
return (-1);
}
if (!S_ISDIR(st->st_mode)) {
if (zpc->zpc_dironly) {
(void) fprintf(stderr, gettext(
"'-d' option on non-dir target %s\n"), name);
return (-1);
}
if (zpc->zpc_recursive) {
(void) fprintf(stderr, gettext(
"'-r' option on non-dir target %s\n"), name);
return (-1);
}
}
return (0);
}
static int
zfs_project_load_projid(const char *name, zfs_project_control_t *zpc)
{
zfsxattr_t fsx;
int ret, fd;
fd = open(name, O_RDONLY | O_NOCTTY);
if (fd < 0) {
(void) fprintf(stderr, gettext("failed to open %s: %s\n"),
name, strerror(errno));
return (fd);
}
ret = ioctl(fd, ZFS_IOC_FSGETXATTR, &fsx);
if (ret)
(void) fprintf(stderr,
gettext("failed to get xattr for %s: %s\n"),
name, strerror(errno));
else
zpc->zpc_expected_projid = fsx.fsx_projid;
close(fd);
return (ret);
}
static int
zfs_project_handle_one(const char *name, zfs_project_control_t *zpc)
{
zfsxattr_t fsx;
int ret, fd;
fd = open(name, O_RDONLY | O_NOCTTY);
if (fd < 0) {
if (errno == ENOENT && zpc->zpc_ignore_noent)
return (0);
(void) fprintf(stderr, gettext("failed to open %s: %s\n"),
name, strerror(errno));
return (fd);
}
ret = ioctl(fd, ZFS_IOC_FSGETXATTR, &fsx);
if (ret) {
(void) fprintf(stderr,
gettext("failed to get xattr for %s: %s\n"),
name, strerror(errno));
goto out;
}
switch (zpc->zpc_op) {
case ZFS_PROJECT_OP_LIST:
(void) printf("%5u %c %s\n", fsx.fsx_projid,
(fsx.fsx_xflags & ZFS_PROJINHERIT_FL) ? 'P' : '-', name);
goto out;
case ZFS_PROJECT_OP_CHECK:
if (fsx.fsx_projid == zpc->zpc_expected_projid &&
fsx.fsx_xflags & ZFS_PROJINHERIT_FL)
goto out;
if (!zpc->zpc_newline) {
char c = '\0';
(void) printf("%s%c", name, c);
goto out;
}
if (fsx.fsx_projid != zpc->zpc_expected_projid)
(void) printf("%s - project ID is not set properly "
"(%u/%u)\n", name, fsx.fsx_projid,
(uint32_t)zpc->zpc_expected_projid);
if (!(fsx.fsx_xflags & ZFS_PROJINHERIT_FL))
(void) printf("%s - project inherit flag is not set\n",
name);
goto out;
case ZFS_PROJECT_OP_CLEAR:
if (!(fsx.fsx_xflags & ZFS_PROJINHERIT_FL) &&
(zpc->zpc_keep_projid ||
fsx.fsx_projid == ZFS_DEFAULT_PROJID))
goto out;
fsx.fsx_xflags &= ~ZFS_PROJINHERIT_FL;
if (!zpc->zpc_keep_projid)
fsx.fsx_projid = ZFS_DEFAULT_PROJID;
break;
case ZFS_PROJECT_OP_SET:
if (fsx.fsx_projid == zpc->zpc_expected_projid &&
(!zpc->zpc_set_flag || fsx.fsx_xflags & ZFS_PROJINHERIT_FL))
goto out;
fsx.fsx_projid = zpc->zpc_expected_projid;
if (zpc->zpc_set_flag)
fsx.fsx_xflags |= ZFS_PROJINHERIT_FL;
break;
default:
ASSERT(0);
break;
}
ret = ioctl(fd, ZFS_IOC_FSSETXATTR, &fsx);
if (ret)
(void) fprintf(stderr,
gettext("failed to set xattr for %s: %s\n"),
name, strerror(errno));
out:
close(fd);
return (ret);
}
static int
zfs_project_handle_dir(const char *name, zfs_project_control_t *zpc,
list_t *head)
{
char fullname[PATH_MAX];
struct dirent *ent;
DIR *dir;
int ret = 0;
dir = opendir(name);
if (dir == NULL) {
if (errno == ENOENT && zpc->zpc_ignore_noent)
return (0);
ret = -errno;
(void) fprintf(stderr, gettext("failed to opendir %s: %s\n"),
name, strerror(errno));
return (ret);
}
/* Non-top item, ignore the case of being removed or renamed by race. */
zpc->zpc_ignore_noent = B_TRUE;
errno = 0;
while (!ret && (ent = readdir(dir)) != NULL) {
/* skip "." and ".." */
if (strcmp(ent->d_name, ".") == 0 ||
strcmp(ent->d_name, "..") == 0)
continue;
if (strlen(ent->d_name) + strlen(name) >=
sizeof (fullname) + 1) {
errno = ENAMETOOLONG;
break;
}
sprintf(fullname, "%s/%s", name, ent->d_name);
ret = zfs_project_handle_one(fullname, zpc);
if (!ret && zpc->zpc_recursive && ent->d_type == DT_DIR)
zfs_project_item_alloc(head, fullname);
}
if (errno && !ret) {
ret = -errno;
(void) fprintf(stderr, gettext("failed to readdir %s: %s\n"),
name, strerror(errno));
}
closedir(dir);
return (ret);
}
int
zfs_project_handle(const char *name, zfs_project_control_t *zpc)
{
zfs_project_item_t *zpi;
struct stat st;
list_t head;
int ret;
ret = zfs_project_sanity_check(name, zpc, &st);
if (ret)
return (ret);
if ((zpc->zpc_op == ZFS_PROJECT_OP_SET ||
zpc->zpc_op == ZFS_PROJECT_OP_CHECK) &&
zpc->zpc_expected_projid == ZFS_INVALID_PROJID) {
ret = zfs_project_load_projid(name, zpc);
if (ret)
return (ret);
}
zpc->zpc_ignore_noent = B_FALSE;
ret = zfs_project_handle_one(name, zpc);
if (ret || !S_ISDIR(st.st_mode) || zpc->zpc_dironly ||
(!zpc->zpc_recursive &&
zpc->zpc_op != ZFS_PROJECT_OP_LIST &&
zpc->zpc_op != ZFS_PROJECT_OP_CHECK))
return (ret);
list_create(&head, sizeof (zfs_project_item_t),
offsetof(zfs_project_item_t, zpi_list));
zfs_project_item_alloc(&head, name);
while ((zpi = list_remove_head(&head)) != NULL) {
if (!ret)
ret = zfs_project_handle_dir(zpi->zpi_name, zpc, &head);
free(zpi);
}
return (ret);
}
+49
View File
@@ -0,0 +1,49 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright (c) 2017, Intel Corporation. All rights reserved.
*/
#ifndef _ZFS_PROJECTUTIL_H
#define _ZFS_PROJECTUTIL_H
typedef enum {
ZFS_PROJECT_OP_DEFAULT = 0,
ZFS_PROJECT_OP_LIST = 1,
ZFS_PROJECT_OP_CHECK = 2,
ZFS_PROJECT_OP_CLEAR = 3,
ZFS_PROJECT_OP_SET = 4,
} zfs_project_ops_t;
typedef struct zfs_project_control {
uint64_t zpc_expected_projid;
zfs_project_ops_t zpc_op;
boolean_t zpc_dironly;
boolean_t zpc_ignore_noent;
boolean_t zpc_keep_projid;
boolean_t zpc_newline;
boolean_t zpc_recursive;
boolean_t zpc_set_flag;
} zfs_project_control_t;
int zfs_project_handle(const char *name, zfs_project_control_t *zpc);
#endif /* _ZFS_PROJECTUTIL_H */
+1 -4
View File
@@ -11,7 +11,4 @@ zhack_SOURCES = \
zhack_LDADD = \
$(top_builddir)/lib/libnvpair/libnvpair.la \
$(top_builddir)/lib/libuutil/libuutil.la \
$(top_builddir)/lib/libzpool/libzpool.la \
$(top_builddir)/lib/libzfs/libzfs.la \
$(top_builddir)/lib/libzfs_core/libzfs_core.la
$(top_builddir)/lib/libzpool/libzpool.la
+5 -10
View File
@@ -48,12 +48,11 @@
#include <sys/zio_compress.h>
#include <sys/zfeature.h>
#include <sys/dmu_tx.h>
#include <libzfs.h>
#include <libzutil.h>
extern boolean_t zfeature_checks_disable;
const char cmdname[] = "zhack";
libzfs_handle_t *g_zfs;
static importargs_t g_importargs;
static char *g_pool;
static boolean_t g_readonly;
@@ -105,7 +104,7 @@ fatal(spa_t *spa, void *tag, const char *fmt, ...)
/* ARGSUSED */
static int
space_delta_cb(dmu_object_type_t bonustype, void *data,
uint64_t *userp, uint64_t *groupp)
uint64_t *userp, uint64_t *groupp, uint64_t *projectp)
{
/*
* Is it a valid type of object to track?
@@ -128,20 +127,17 @@ zhack_import(char *target, boolean_t readonly)
int error;
kernel_init(readonly ? FREAD : (FREAD | FWRITE));
g_zfs = libzfs_init();
ASSERT(g_zfs != NULL);
dmu_objset_register_type(DMU_OST_ZFS, space_delta_cb);
g_readonly = readonly;
g_importargs.unique = B_TRUE;
g_importargs.can_be_active = readonly;
g_pool = strdup(target);
error = zpool_tryimport(g_zfs, target, &config, &g_importargs);
error = zpool_find_config(NULL, target, &config, &g_importargs,
&libzpool_config_ops);
if (error)
fatal(NULL, FTAG, "cannot import '%s': %s", target,
libzfs_error_description(g_zfs));
fatal(NULL, FTAG, "cannot import '%s'", target);
props = NULL;
if (readonly) {
@@ -529,7 +525,6 @@ main(int argc, char **argv)
"changes may not be committed to disk\n");
}
libzfs_fini(g_zfs);
kernel_fini();
return (rv);
+1 -4
View File
@@ -13,7 +13,4 @@ zinject_SOURCES = \
zinject_LDADD = \
$(top_builddir)/lib/libnvpair/libnvpair.la \
$(top_builddir)/lib/libuutil/libuutil.la \
$(top_builddir)/lib/libzpool/libzpool.la \
$(top_builddir)/lib/libzfs/libzfs.la \
$(top_builddir)/lib/libzfs_core/libzfs_core.la
$(top_builddir)/lib/libzfs/libzfs.la
+16 -110
View File
@@ -25,8 +25,6 @@
#include <libzfs.h>
#include <sys/zfs_context.h>
#include <errno.h>
#include <fcntl.h>
#include <stdarg.h>
@@ -49,9 +47,6 @@
#include "zinject.h"
extern void kernel_init(int);
extern void kernel_fini(void);
static int debug;
static void
@@ -161,51 +156,32 @@ parse_pathname(const char *inpath, char *dataset, char *relpath,
}
/*
* Convert from a (dataset, path) pair into a (objset, object) pair. Note that
* we grab the object number from the inode number, since looking this up via
* libzpool is a real pain.
* Convert from a dataset to a objset id. Note that
* we grab the object number from the inode number.
*/
/* ARGSUSED */
static int
object_from_path(const char *dataset, const char *path, struct stat64 *statbuf,
zinject_record_t *record)
object_from_path(const char *dataset, uint64_t object, zinject_record_t *record)
{
objset_t *os;
int err;
zfs_handle_t *zhp;
/*
* Before doing any libzpool operations, call sync() to ensure that the
* on-disk state is consistent with the in-core state.
*/
sync();
err = dmu_objset_own(dataset, DMU_OST_ZFS, B_TRUE, FTAG, &os);
if (err != 0) {
(void) fprintf(stderr, "cannot open dataset '%s': %s\n",
dataset, strerror(err));
if ((zhp = zfs_open(g_zfs, dataset, ZFS_TYPE_DATASET)) == NULL)
return (-1);
}
record->zi_objset = dmu_objset_id(os);
record->zi_object = statbuf->st_ino;
record->zi_objset = zfs_prop_get_int(zhp, ZFS_PROP_OBJSETID);
record->zi_object = object;
dmu_objset_disown(os, FTAG);
zfs_close(zhp);
return (0);
}
/*
* Calculate the real range based on the type, level, and range given.
* Intialize the range based on the type, level, and range given.
*/
static int
calculate_range(const char *dataset, err_type_t type, int level, char *range,
initialize_range(err_type_t type, int level, char *range,
zinject_record_t *record)
{
objset_t *os = NULL;
dnode_t *dn = NULL;
int err;
int ret = -1;
/*
* Determine the numeric range from the string.
*/
@@ -233,7 +209,7 @@ calculate_range(const char *dataset, err_type_t type, int level, char *range,
(void) fprintf(stderr, "invalid range '%s': must be "
"a numeric range of the form 'start[,end]'\n",
range);
goto out;
return (-1);
}
}
@@ -253,7 +229,7 @@ calculate_range(const char *dataset, err_type_t type, int level, char *range,
if (range != NULL) {
(void) fprintf(stderr, "range cannot be specified when "
"type is 'dnode'\n");
goto out;
return (-1);
}
record->zi_start = record->zi_object * sizeof (dnode_phys_t);
@@ -262,76 +238,9 @@ calculate_range(const char *dataset, err_type_t type, int level, char *range,
break;
}
/*
* Get the dnode associated with object, so we can calculate the block
* size.
*/
if ((err = dmu_objset_own(dataset, DMU_OST_ANY,
B_TRUE, FTAG, &os)) != 0) {
(void) fprintf(stderr, "cannot open dataset '%s': %s\n",
dataset, strerror(err));
goto out;
}
if (record->zi_object == 0) {
dn = DMU_META_DNODE(os);
} else {
err = dnode_hold(os, record->zi_object, FTAG, &dn);
if (err != 0) {
(void) fprintf(stderr, "failed to hold dnode "
"for object %llu\n",
(u_longlong_t)record->zi_object);
goto out;
}
}
ziprintf("data shift: %d\n", (int)dn->dn_datablkshift);
ziprintf(" ind shift: %d\n", (int)dn->dn_indblkshift);
/*
* Translate range into block IDs.
*/
if (record->zi_start != 0 || record->zi_end != -1ULL) {
record->zi_start >>= dn->dn_datablkshift;
record->zi_end >>= dn->dn_datablkshift;
}
/*
* Check level, and then translate level 0 blkids into ranges
* appropriate for level of indirection.
*/
record->zi_level = level;
if (level > 0) {
ziprintf("level 0 blkid range: [%llu, %llu]\n",
record->zi_start, record->zi_end);
if (level >= dn->dn_nlevels) {
(void) fprintf(stderr, "level %d exceeds max level "
"of object (%d)\n", level, dn->dn_nlevels - 1);
goto out;
}
if (record->zi_start != 0 || record->zi_end != 0) {
int shift = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
for (; level > 0; level--) {
record->zi_start >>= shift;
record->zi_end >>= shift;
}
}
}
ret = 0;
out:
if (dn) {
if (dn != DMU_META_DNODE(os))
dnode_rele(dn, FTAG);
}
if (os)
dmu_objset_disown(os, FTAG);
return (ret);
return (0);
}
int
@@ -343,8 +252,6 @@ translate_record(err_type_t type, const char *object, const char *range,
struct stat64 statbuf;
int ret = -1;
kernel_init(FREAD);
debug = (getenv("ZINJECT_DEBUG") != NULL);
ziprintf("translating: %s\n", object);
@@ -396,16 +303,16 @@ translate_record(err_type_t type, const char *object, const char *range,
/*
* Convert (dataset, file) into (objset, object)
*/
if (object_from_path(dataset, path, &statbuf, record) != 0)
if (object_from_path(dataset, statbuf.st_ino, record) != 0)
goto err;
ziprintf("raw objset: %llu\n", record->zi_objset);
ziprintf("raw object: %llu\n", record->zi_object);
/*
* For the given object, calculate the real (type, level, range)
* For the given object, intialize the range in bytes
*/
if (calculate_range(dataset, type, level, (char *)range, record) != 0)
if (initialize_range(type, level, (char *)range, record) != 0)
goto err;
ziprintf(" objset: %llu\n", record->zi_objset);
@@ -427,7 +334,6 @@ translate_record(err_type_t type, const char *object, const char *range,
ret = 0;
err:
kernel_fini();
return (ret);
}
+162 -29
View File
@@ -36,12 +36,15 @@
*
* Errors can be injected into a particular vdev using the '-d' option. This
* option takes a path or vdev GUID to uniquely identify the device within a
* pool. There are two types of errors that can be injected, EIO and ENXIO,
* that can be controlled through the '-e' option. The default is ENXIO. For
* EIO failures, any attempt to read data from the device will return EIO, but
* subsequent attempt to reopen the device will succeed. For ENXIO failures,
* any attempt to read from the device will return EIO, but any attempt to
* reopen the device will also return ENXIO.
* pool. There are four types of errors that can be injected, IO, ENXIO,
* ECHILD, and EILSEQ. These can be controlled through the '-e' option and the
* default is ENXIO. For EIO failures, any attempt to read data from the device
* will return EIO, but a subsequent attempt to reopen the device will succeed.
* For ENXIO failures, any attempt to read from the device will return EIO, but
* any attempt to reopen the device will also return ENXIO. The EILSEQ failures
* only apply to read operations (-T read) and will flip a bit after the device
* has read the original data.
*
* For label faults, the -L option must be specified. This allows faults
* to be injected into either the nvlist, uberblock, pad1, or pad2 region
* of all the labels for the specified device.
@@ -113,9 +116,9 @@
* specified.
*
* The '-e' option takes a string describing the errno to simulate. This must
* be either 'io' or 'checksum'. In most cases this will result in the same
* behavior, but RAID-Z will produce a different set of ereports for this
* situation.
* be one of 'io', 'checksum', 'decompress', or 'decrypt'. In most cases this
* will result in the same behavior, but RAID-Z will produce a different set of
* ereports for this situation.
*
* The '-a', '-u', and '-m' flags toggle internal flush behavior. If '-a' is
* specified, then the ARC cache is flushed appropriately. If '-u' is
@@ -231,11 +234,12 @@ usage(void)
"\t\tspa_vdev_exit() will trigger a panic.\n"
"\n"
"\tzinject -d device [-e errno] [-L <nvlist|uber|pad1|pad2>] [-F]\n"
"\t [-T <read|write|free|claim|all>] [-f frequency] pool\n"
"\t\t[-T <read|write|free|claim|all>] [-f frequency] pool\n\n"
"\t\tInject a fault into a particular device or the device's\n"
"\t\tlabel. Label injection can either be 'nvlist', 'uber',\n "
"\t\t'pad1', or 'pad2'.\n"
"\t\t'errno' can be 'nxio' (the default), 'io', or 'dtl'.\n"
"\t\t'errno' can be 'nxio' (the default), 'io', 'dtl', or\n"
"\t\t'corrupt' (bit flip).\n"
"\t\t'frequency' is a value between 0.0001 and 100.0 that limits\n"
"\t\tdevice error injection to a percentage of the IOs.\n"
"\n"
@@ -287,16 +291,19 @@ usage(void)
"\t\tspecified by the remaining tuple. Each number is in\n"
"\t\thexadecimal, and only one block can be specified.\n"
"\n"
"\tzinject [-q] <-t type> [-e errno] [-l level] [-r range]\n"
"\t [-a] [-m] [-u] [-f freq] <object>\n"
"\tzinject [-q] <-t type> [-C dvas] [-e errno] [-l level]\n"
"\t\t[-r range] [-a] [-m] [-u] [-f freq] <object>\n"
"\n"
"\t\tInject an error into the object specified by the '-t' option\n"
"\t\tand the object descriptor. The 'object' parameter is\n"
"\t\tinterpreted depending on the '-t' option.\n"
"\n"
"\t\t-q\tQuiet mode. Only print out the handler number added.\n"
"\t\t-e\tInject a specific error. Must be either 'io' or\n"
"\t\t\t'checksum'. Default is 'io'.\n"
"\t\t-e\tInject a specific error. Must be one of 'io',\n"
"\t\t\t'checksum', 'decompress', or 'decrypt'. Default is 'io'.\n"
"\t\t-C\tInject the given error only into specific DVAs. The\n"
"\t\t\tDVAs should be specified as a list of 0-indexed DVAs\n"
"\t\t\tseparated by commas (ex. '0,2').\n"
"\t\t-l\tInject error at a particular block level. Default is "
"0.\n"
"\t\t-m\tAutomatically remount underlying filesystem.\n"
@@ -357,17 +364,20 @@ print_data_handler(int id, const char *pool, zinject_record_t *record,
return (0);
if (*count == 0) {
(void) printf("%3s %-15s %-6s %-6s %-8s %3s %-15s\n",
"ID", "POOL", "OBJSET", "OBJECT", "TYPE", "LVL", "RANGE");
(void) printf("%3s %-15s %-6s %-6s %-8s %3s %-4s "
"%-15s\n", "ID", "POOL", "OBJSET", "OBJECT", "TYPE",
"LVL", "DVAs", "RANGE");
(void) printf("--- --------------- ------ "
"------ -------- --- ---------------\n");
"------ -------- --- ---- ---------------\n");
}
*count += 1;
(void) printf("%3d %-15s %-6llu %-6llu %-8s %3d ", id, pool,
(u_longlong_t)record->zi_objset, (u_longlong_t)record->zi_object,
type_to_name(record->zi_type), record->zi_level);
(void) printf("%3d %-15s %-6llu %-6llu %-8s %-3d 0x%02x ",
id, pool, (u_longlong_t)record->zi_objset,
(u_longlong_t)record->zi_object, type_to_name(record->zi_type),
record->zi_level, record->zi_dvas);
if (record->zi_start == 0 &&
record->zi_end == -1ULL)
@@ -557,6 +567,7 @@ register_handler(const char *pool, int flags, zinject_record_t *record,
if (ioctl(zfs_fd, ZFS_IOC_INJECT_FAULT, &zc) != 0) {
(void) fprintf(stderr, "failed to add handler: %s\n",
errno == EDOM ? "block level exceeds max level of object" :
strerror(errno));
return (1);
}
@@ -597,6 +608,7 @@ register_handler(const char *pool, int flags, zinject_record_t *record,
(void) printf(" range: [%llu, %llu)\n",
(u_longlong_t)record->zi_start,
(u_longlong_t)record->zi_end);
(void) printf(" dvas: 0x%x\n", record->zi_dvas);
}
}
@@ -669,6 +681,59 @@ parse_frequency(const char *str, uint32_t *percent)
return (0);
}
/*
* This function converts a string specifier for DVAs into a bit mask.
* The dva's provided by the user should be 0 indexed and separated by
* a comma. For example:
* "1" -> 0b0010 (0x2)
* "0,1" -> 0b0011 (0x3)
* "0,1,2" -> 0b0111 (0x7)
*/
static int
parse_dvas(const char *str, uint32_t *dvas_out)
{
const char *c = str;
uint32_t mask = 0;
boolean_t need_delim = B_FALSE;
/* max string length is 5 ("0,1,2") */
if (strlen(str) > 5 || strlen(str) == 0)
return (EINVAL);
while (*c != '\0') {
switch (*c) {
case '0':
case '1':
case '2':
/* check for pipe between DVAs */
if (need_delim)
return (EINVAL);
/* check if this DVA has been set already */
if (mask & (1 << ((*c) - '0')))
return (EINVAL);
mask |= (1 << ((*c) - '0'));
need_delim = B_TRUE;
break;
case ',':
need_delim = B_FALSE;
break;
default:
/* check for invalid character */
return (EINVAL);
}
c++;
}
/* check for dangling delimiter */
if (!need_delim)
return (EINVAL);
*dvas_out = mask;
return (0);
}
int
main(int argc, char **argv)
{
@@ -695,6 +760,7 @@ main(int argc, char **argv)
int dur_secs = 0;
int ret;
int flags = 0;
uint32_t dvas = 0;
if ((g_zfs = libzfs_init()) == NULL) {
(void) fprintf(stderr, "%s", libzfs_error_init(errno));
@@ -725,7 +791,7 @@ main(int argc, char **argv)
}
while ((c = getopt(argc, argv,
":aA:b:d:D:f:Fg:qhIc:t:T:l:mr:s:e:uL:p:")) != -1) {
":aA:b:C:d:D:f:Fg:qhIc:t:T:l:mr:s:e:uL:p:")) != -1) {
switch (c) {
case 'a':
flags |= ZINJECT_FLUSH_ARC;
@@ -749,6 +815,17 @@ main(int argc, char **argv)
case 'c':
cancel = optarg;
break;
case 'C':
ret = parse_dvas(optarg, &dvas);
if (ret != 0) {
(void) fprintf(stderr, "invalid DVA list '%s': "
"DVAs should be 0 indexed and separated by "
"commas.\n", optarg);
usage();
libzfs_fini(g_zfs);
return (1);
}
break;
case 'd':
device = optarg;
break;
@@ -770,10 +847,16 @@ main(int argc, char **argv)
error = EIO;
} else if (strcasecmp(optarg, "checksum") == 0) {
error = ECKSUM;
} else if (strcasecmp(optarg, "decompress") == 0) {
error = EINVAL;
} else if (strcasecmp(optarg, "decrypt") == 0) {
error = EACCES;
} else if (strcasecmp(optarg, "nxio") == 0) {
error = ENXIO;
} else if (strcasecmp(optarg, "dtl") == 0) {
error = ECHILD;
} else if (strcasecmp(optarg, "corrupt") == 0) {
error = EILSEQ;
} else {
(void) fprintf(stderr, "invalid error type "
"'%s': must be 'io', 'checksum' or "
@@ -843,6 +926,7 @@ main(int argc, char **argv)
break;
case 'r':
range = optarg;
flags |= ZINJECT_CALC_RANGE;
break;
case 's':
dur_secs = 1;
@@ -925,7 +1009,7 @@ main(int argc, char **argv)
*/
if (raw != NULL || range != NULL || type != TYPE_INVAL ||
level != 0 || record.zi_cmd != ZINJECT_UNINITIALIZED ||
record.zi_freq > 0) {
record.zi_freq > 0 || dvas != 0) {
(void) fprintf(stderr, "cancel (-c) incompatible with "
"any other options\n");
usage();
@@ -960,7 +1044,8 @@ main(int argc, char **argv)
* for doing injection, so handle it separately here.
*/
if (raw != NULL || range != NULL || type != TYPE_INVAL ||
level != 0 || record.zi_cmd != ZINJECT_UNINITIALIZED) {
level != 0 || record.zi_cmd != ZINJECT_UNINITIALIZED ||
dvas != 0) {
(void) fprintf(stderr, "device (-d) incompatible with "
"data error injection\n");
usage();
@@ -981,7 +1066,15 @@ main(int argc, char **argv)
if (error == ECKSUM) {
(void) fprintf(stderr, "device error type must be "
"'io' or 'nxio'\n");
"'io', 'nxio' or 'corrupt'\n");
libzfs_fini(g_zfs);
return (1);
}
if (error == EILSEQ &&
(record.zi_freq == 0 || io_type != ZIO_TYPE_READ)) {
(void) fprintf(stderr, "device corrupt errors require "
"io type read and a frequency value\n");
libzfs_fini(g_zfs);
return (1);
}
@@ -1000,7 +1093,7 @@ main(int argc, char **argv)
} else if (raw != NULL) {
if (range != NULL || type != TYPE_INVAL || level != 0 ||
record.zi_cmd != ZINJECT_UNINITIALIZED ||
record.zi_freq > 0) {
record.zi_freq > 0 || dvas != 0) {
(void) fprintf(stderr, "raw (-b) format with "
"any other options\n");
usage();
@@ -1035,7 +1128,8 @@ main(int argc, char **argv)
error = EIO;
} else if (record.zi_cmd == ZINJECT_PANIC) {
if (raw != NULL || range != NULL || type != TYPE_INVAL ||
level != 0 || device != NULL || record.zi_freq > 0) {
level != 0 || device != NULL || record.zi_freq > 0 ||
dvas != 0) {
(void) fprintf(stderr, "panic (-p) incompatible with "
"other options\n");
usage();
@@ -1056,6 +1150,15 @@ main(int argc, char **argv)
record.zi_type = atoi(argv[1]);
dataset[0] = '\0';
} else if (record.zi_cmd == ZINJECT_IGNORED_WRITES) {
if (raw != NULL || range != NULL || type != TYPE_INVAL ||
level != 0 || record.zi_freq > 0 || dvas != 0) {
(void) fprintf(stderr, "hardware failure (-I) "
"incompatible with other options\n");
usage();
libzfs_fini(g_zfs);
return (2);
}
if (nowrites == 0) {
(void) fprintf(stderr, "-s or -g meaningless "
"without -I (ignore writes)\n");
@@ -1109,14 +1212,44 @@ main(int argc, char **argv)
return (2);
}
if (error == ENXIO) {
if (error == ENXIO || error == EILSEQ) {
(void) fprintf(stderr, "data error type must be "
"'checksum' or 'io'\n");
libzfs_fini(g_zfs);
return (1);
}
record.zi_cmd = ZINJECT_DATA_FAULT;
if (dvas != 0) {
if (error == EACCES || error == EINVAL) {
(void) fprintf(stderr, "the '-C' option may "
"not be used with logical data errors "
"'decrypt' and 'decompress'\n");
libzfs_fini(g_zfs);
return (1);
}
record.zi_dvas = dvas;
}
if (error == EACCES) {
if (type != TYPE_DATA) {
(void) fprintf(stderr, "decryption errors "
"may only be injected for 'data' types\n");
libzfs_fini(g_zfs);
return (1);
}
record.zi_cmd = ZINJECT_DECRYPT_FAULT;
/*
* Internally, ZFS actually uses ECKSUM for decryption
* errors since EACCES is used to indicate the key was
* not found.
*/
error = ECKSUM;
} else {
record.zi_cmd = ZINJECT_DATA_FAULT;
}
if (translate_record(type, argv[0], range, level, &record, pool,
dataset) != 0) {
libzfs_fini(g_zfs);
-1
View File
@@ -1 +0,0 @@
/zpios
-11
View File
@@ -1,11 +0,0 @@
include $(top_srcdir)/config/Rules.am
DEFAULT_INCLUDES += \
-I$(top_srcdir)/include
sbin_PROGRAMS = zpios
zpios_SOURCES = \
zpios_main.c \
zpios_util.c \
zpios.h
-127
View File
@@ -1,127 +0,0 @@
/*
* ZPIOS is a heavily modified version of the original PIOS test code.
* It is designed to have the test code running in the Linux kernel
* against ZFS while still being flexibly controlled from user space.
*
* Copyright (C) 2008-2010 Lawrence Livermore National Security, LLC.
* Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
* Written by Brian Behlendorf <behlendorf1@llnl.gov>.
* LLNL-CODE-403049
*
* Original PIOS Test Code
* Copyright (C) 2004 Cluster File Systems, Inc.
* Written by Peter Braam <braam@clusterfs.com>
* Atul Vidwansa <atul@clusterfs.com>
* Milind Dumbare <milind@clusterfs.com>
*
* This file is part of ZFS on Linux.
* For details, see <http://zfsonlinux.org/>.
*
* ZPIOS is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License as published by the
* Free Software Foundation; either version 2 of the License, or (at your
* option) any later version.
*
* ZPIOS is distributed in the hope that it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
* for more details.
*
* You should have received a copy of the GNU General Public License along
* with ZPIOS. If not, see <http://www.gnu.org/licenses/>.
*
* Copyright (c) 2015, Intel Corporation.
*/
#ifndef _ZPIOS_H
#define _ZPIOS_H
#include <zpios-ctl.h>
#define VERSION_SIZE 64
/* Regular expressions */
#define REGEX_NUMBERS "^[0-9]+$"
#define REGEX_NUMBERS_COMMA "^([0-9]+,)*[0-9]+$"
#define REGEX_SIZE "^[0-9]+[kKmMgGtT]?$"
#define REGEX_SIZE_COMMA "^([0-9]+[kKmMgGtT]?,)*[0-9]+[kKmMgGtT]?$"
/* Flags for low, high, incr */
#define FLAG_SET 0x01
#define FLAG_LOW 0x02
#define FLAG_HIGH 0x04
#define FLAG_INCR 0x08
#define TRUE 1
#define FALSE 0
#define KB (1024)
#define MB (KB * 1024)
#define GB (MB * 1024)
#define TB (GB * 1024)
#define KMGT_SIZE 16
/*
* All offsets, sizes and counts can be passed to the application in
* multiple ways.
* 1. a value (stored in val[0], val_count will be 1)
* 2. a comma separated list of values (stored in val[], using val_count)
* 3. a range and block sizes, low, high, factor (val_count must be 0)
*/
typedef struct pios_range_repeat {
uint64_t val[32]; /* Comma sep array, or low, high, inc */
uint64_t val_count; /* Num of values */
uint64_t val_low;
uint64_t val_high;
uint64_t val_inc_perc;
uint64_t next_val; /* For multiple runs in get_next() */
} range_repeat_t;
typedef struct cmd_args {
range_repeat_t T; /* Thread count */
range_repeat_t N; /* Region count */
range_repeat_t O; /* Offset count */
range_repeat_t C; /* Chunksize */
range_repeat_t S; /* Regionsize */
range_repeat_t B; /* Blocksize */
const char *pool; /* Pool */
const char *name; /* Name */
uint32_t flags; /* Flags */
uint32_t block_size; /* ZFS block size */
uint32_t io_type; /* DMUIO only */
uint32_t verbose; /* Verbose */
uint32_t human_readable; /* Human readable output */
uint64_t regionnoise; /* Region noise */
uint64_t chunknoise; /* Chunk noise */
uint64_t thread_delay; /* Thread delay */
char pre[ZPIOS_PATH_SIZE]; /* Pre-exec hook */
char post[ZPIOS_PATH_SIZE]; /* Post-exec hook */
char log[ZPIOS_PATH_SIZE]; /* Requested log dir */
/* Control */
int current_id;
uint64_t current_T;
uint64_t current_N;
uint64_t current_C;
uint64_t current_S;
uint64_t current_O;
uint64_t current_B;
uint32_t rc;
} cmd_args_t;
int set_count(char *pattern1, char *pattern2, range_repeat_t *range,
char *optarg, uint32_t *flags, char *arg);
int set_lhi(char *pattern, range_repeat_t *range, char *optarg,
int flag, uint32_t *flag_thread, char *arg);
int set_noise(uint64_t *noise, char *optarg, char *arg);
int set_load_params(cmd_args_t *args, char *optarg);
int check_mutual_exclusive_command_lines(uint32_t flag, char *arg);
void print_stats_header(cmd_args_t *args);
void print_stats(cmd_args_t *args, zpios_cmd_t *cmd);
#endif /* _ZPIOS_H */
-681
View File
@@ -1,681 +0,0 @@
/*
* ZPIOS is a heavily modified version of the original PIOS test code.
* It is designed to have the test code running in the Linux kernel
* against ZFS while still being flexibly controlled from user space.
*
* Copyright (C) 2008-2010 Lawrence Livermore National Security, LLC.
* Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
* Written by Brian Behlendorf <behlendorf1@llnl.gov>.
* LLNL-CODE-403049
*
* Original PIOS Test Code
* Copyright (C) 2004 Cluster File Systems, Inc.
* Written by Peter Braam <braam@clusterfs.com>
* Atul Vidwansa <atul@clusterfs.com>
* Milind Dumbare <milind@clusterfs.com>
*
* This file is part of ZFS on Linux.
* For details, see <http://zfsonlinux.org/>.
*
* ZPIOS is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License as published by the
* Free Software Foundation; either version 2 of the License, or (at your
* option) any later version.
*
* ZPIOS is distributed in the hope that it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
* for more details.
*
* You should have received a copy of the GNU General Public License along
* with ZPIOS. If not, see <http://www.gnu.org/licenses/>.
*
* Copyright (c) 2015, Intel Corporation.
*/
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <getopt.h>
#include <assert.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>
#include "zpios.h"
static const char short_opt[] =
"t:l:h:e:n:i:j:k:o:m:q:r:c:a:b:g:s:A:B:C:"
"S:L:p:M:xP:R:G:I:N:T:VzOfHv?";
static const struct option long_opt[] = {
{"threadcount", required_argument, 0, 't' },
{"threadcount_low", required_argument, 0, 'l' },
{"threadcount_high", required_argument, 0, 'h' },
{"threadcount_incr", required_argument, 0, 'e' },
{"regioncount", required_argument, 0, 'n' },
{"regioncount_low", required_argument, 0, 'i' },
{"regioncount_high", required_argument, 0, 'j' },
{"regioncount_incr", required_argument, 0, 'k' },
{"offset", required_argument, 0, 'o' },
{"offset_low", required_argument, 0, 'm' },
{"offset_high", required_argument, 0, 'q' },
{"offset_incr", required_argument, 0, 'r' },
{"chunksize", required_argument, 0, 'c' },
{"chunksize_low", required_argument, 0, 'a' },
{"chunksize_high", required_argument, 0, 'b' },
{"chunksize_incr", required_argument, 0, 'g' },
{"regionsize", required_argument, 0, 's' },
{"regionsize_low", required_argument, 0, 'A' },
{"regionsize_high", required_argument, 0, 'B' },
{"regionsize_incr", required_argument, 0, 'C' },
{"blocksize", required_argument, 0, 'S' },
{"load", required_argument, 0, 'L' },
{"pool", required_argument, 0, 'p' },
{"name", required_argument, 0, 'M' },
{"cleanup", no_argument, 0, 'x' },
{"prerun", required_argument, 0, 'P' },
{"postrun", required_argument, 0, 'R' },
{"log", required_argument, 0, 'G' },
{"regionnoise", required_argument, 0, 'I' },
{"chunknoise", required_argument, 0, 'N' },
{"threaddelay", required_argument, 0, 'T' },
{"verify", no_argument, 0, 'V' },
{"zerocopy", no_argument, 0, 'z' },
{"nowait", no_argument, 0, 'O' },
{"noprefetch", no_argument, 0, 'f' },
{"human-readable", no_argument, 0, 'H' },
{"verbose", no_argument, 0, 'v' },
{"help", no_argument, 0, '?' },
{ 0, 0, 0, 0 },
};
static int zpiosctl_fd; /* Control file descriptor */
static char zpios_version[VERSION_SIZE]; /* Kernel version string */
static char *zpios_buffer = NULL; /* Scratch space area */
static int zpios_buffer_size = 0; /* Scratch space size */
static int
usage(void)
{
fprintf(stderr, "Usage: zpios\n");
fprintf(stderr,
" --threadcount -t =values\n"
" --threadcount_low -l =value\n"
" --threadcount_high -h =value\n"
" --threadcount_incr -e =value\n"
" --regioncount -n =values\n"
" --regioncount_low -i =value\n"
" --regioncount_high -j =value\n"
" --regioncount_incr -k =value\n"
" --offset -o =values\n"
" --offset_low -m =value\n"
" --offset_high -q =value\n"
" --offset_incr -r =value\n"
" --chunksize -c =values\n"
" --chunksize_low -a =value\n"
" --chunksize_high -b =value\n"
" --chunksize_incr -g =value\n"
" --regionsize -s =values\n"
" --regionsize_low -A =value\n"
" --regionsize_high -B =value\n"
" --regionsize_incr -C =value\n"
" --blocksize -S =values\n"
" --load -L =dmuio|ssf|fpp\n"
" --pool -p =pool name\n"
" --name -M =test name\n"
" --cleanup -x\n"
" --prerun -P =pre-command\n"
" --postrun -R =post-command\n"
" --log -G =log directory\n"
" --regionnoise -I =shift\n"
" --chunknoise -N =bytes\n"
" --threaddelay -T =jiffies\n"
" --verify -V\n"
" --zerocopy -z\n"
" --nowait -O\n"
" --noprefetch -f\n"
" --human-readable -H\n"
" --verbose -v =increase verbosity\n"
" --help -? =this help\n\n");
return (0);
}
static void args_fini(cmd_args_t *args)
{
assert(args != NULL);
free(args);
}
/* block size is 128K to 16M, power of 2 */
#define MIN_BLKSIZE (128ULL << 10)
#define MAX_BLKSIZE (16ULL << 20)
#define POW_OF_TWO(x) (((x) & ((x) - 1)) == 0)
static cmd_args_t *
args_init(int argc, char **argv)
{
cmd_args_t *args;
uint32_t fl_th = 0;
uint32_t fl_rc = 0;
uint32_t fl_of = 0;
uint32_t fl_rs = 0;
uint32_t fl_cs = 0;
uint32_t fl_bs = 0;
int c, rc, i;
if (argc == 1) {
usage();
return ((cmd_args_t *)NULL);
}
/* Configure and populate the args structures */
args = malloc(sizeof (*args));
if (args == NULL)
return (NULL);
memset(args, 0, sizeof (*args));
/* provide a default block size of 128K */
args->B.next_val = 0;
args->B.val[0] = MIN_BLKSIZE;
args->B.val_count = 1;
while ((c = getopt_long(argc, argv, short_opt, long_opt, NULL)) != -1) {
rc = 0;
switch (c) {
case 't': /* --thread count */
rc = set_count(REGEX_NUMBERS, REGEX_NUMBERS_COMMA,
&args->T, optarg, &fl_th, "threadcount");
break;
case 'l': /* --threadcount_low */
rc = set_lhi(REGEX_NUMBERS, &args->T, optarg,
FLAG_LOW, &fl_th, "threadcount_low");
break;
case 'h': /* --threadcount_high */
rc = set_lhi(REGEX_NUMBERS, &args->T, optarg,
FLAG_HIGH, &fl_th, "threadcount_high");
break;
case 'e': /* --threadcount_inc */
rc = set_lhi(REGEX_NUMBERS, &args->T, optarg,
FLAG_INCR, &fl_th, "threadcount_incr");
break;
case 'n': /* --regioncount */
rc = set_count(REGEX_NUMBERS, REGEX_NUMBERS_COMMA,
&args->N, optarg, &fl_rc, "regioncount");
break;
case 'i': /* --regioncount_low */
rc = set_lhi(REGEX_NUMBERS, &args->N, optarg,
FLAG_LOW, &fl_rc, "regioncount_low");
break;
case 'j': /* --regioncount_high */
rc = set_lhi(REGEX_NUMBERS, &args->N, optarg,
FLAG_HIGH, &fl_rc, "regioncount_high");
break;
case 'k': /* --regioncount_inc */
rc = set_lhi(REGEX_NUMBERS, &args->N, optarg,
FLAG_INCR, &fl_rc, "regioncount_incr");
break;
case 'o': /* --offset */
rc = set_count(REGEX_SIZE, REGEX_SIZE_COMMA,
&args->O, optarg, &fl_of, "offset");
break;
case 'm': /* --offset_low */
rc = set_lhi(REGEX_SIZE, &args->O, optarg,
FLAG_LOW, &fl_of, "offset_low");
break;
case 'q': /* --offset_high */
rc = set_lhi(REGEX_SIZE, &args->O, optarg,
FLAG_HIGH, &fl_of, "offset_high");
break;
case 'r': /* --offset_inc */
rc = set_lhi(REGEX_NUMBERS, &args->O, optarg,
FLAG_INCR, &fl_of, "offset_incr");
break;
case 'c': /* --chunksize */
rc = set_count(REGEX_SIZE, REGEX_SIZE_COMMA,
&args->C, optarg, &fl_cs, "chunksize");
break;
case 'a': /* --chunksize_low */
rc = set_lhi(REGEX_SIZE, &args->C, optarg,
FLAG_LOW, &fl_cs, "chunksize_low");
break;
case 'b': /* --chunksize_high */
rc = set_lhi(REGEX_SIZE, &args->C, optarg,
FLAG_HIGH, &fl_cs, "chunksize_high");
break;
case 'g': /* --chunksize_inc */
rc = set_lhi(REGEX_NUMBERS, &args->C, optarg,
FLAG_INCR, &fl_cs, "chunksize_incr");
break;
case 's': /* --regionsize */
rc = set_count(REGEX_SIZE, REGEX_SIZE_COMMA,
&args->S, optarg, &fl_rs, "regionsize");
break;
case 'A': /* --regionsize_low */
rc = set_lhi(REGEX_SIZE, &args->S, optarg,
FLAG_LOW, &fl_rs, "regionsize_low");
break;
case 'B': /* --regionsize_high */
rc = set_lhi(REGEX_SIZE, &args->S, optarg,
FLAG_HIGH, &fl_rs, "regionsize_high");
break;
case 'C': /* --regionsize_inc */
rc = set_lhi(REGEX_NUMBERS, &args->S, optarg,
FLAG_INCR, &fl_rs, "regionsize_incr");
break;
case 'S': /* --blocksize */
rc = set_count(REGEX_SIZE, REGEX_SIZE_COMMA,
&args->B, optarg, &fl_bs, "blocksize");
break;
case 'L': /* --load */
rc = set_load_params(args, optarg);
break;
case 'p': /* --pool */
args->pool = optarg;
break;
case 'M':
args->name = optarg;
break;
case 'x': /* --cleanup */
args->flags |= DMU_REMOVE;
break;
case 'P': /* --prerun */
strncpy(args->pre, optarg, ZPIOS_PATH_SIZE - 1);
break;
case 'R': /* --postrun */
strncpy(args->post, optarg, ZPIOS_PATH_SIZE - 1);
break;
case 'G': /* --log */
strncpy(args->log, optarg, ZPIOS_PATH_SIZE - 1);
break;
case 'I': /* --regionnoise */
rc = set_noise(&args->regionnoise, optarg,
"regionnoise");
break;
case 'N': /* --chunknoise */
rc = set_noise(&args->chunknoise, optarg, "chunknoise");
break;
case 'T': /* --threaddelay */
rc = set_noise(&args->thread_delay, optarg,
"threaddelay");
break;
case 'V': /* --verify */
args->flags |= DMU_VERIFY;
break;
case 'z': /* --zerocopy */
args->flags |= (DMU_WRITE_ZC | DMU_READ_ZC);
break;
case 'O': /* --nowait */
args->flags |= DMU_WRITE_NOWAIT;
break;
case 'f': /* --noprefetch */
args->flags |= DMU_READ_NOPF;
break;
case 'H': /* --human-readable */
args->human_readable = 1;
break;
case 'v': /* --verbose */
args->verbose++;
break;
case '?':
rc = 1;
break;
default:
fprintf(stderr, "Unknown option '%s'\n",
argv[optind - 1]);
rc = EINVAL;
break;
}
if (rc) {
usage();
args_fini(args);
return (NULL);
}
}
check_mutual_exclusive_command_lines(fl_th, "threadcount");
check_mutual_exclusive_command_lines(fl_rc, "regioncount");
check_mutual_exclusive_command_lines(fl_of, "offset");
check_mutual_exclusive_command_lines(fl_rs, "regionsize");
check_mutual_exclusive_command_lines(fl_cs, "chunksize");
if (args->pool == NULL) {
fprintf(stderr, "Error: Pool not specified\n");
usage();
args_fini(args);
return (NULL);
}
if ((args->flags & (DMU_WRITE_ZC | DMU_READ_ZC)) &&
(args->flags & DMU_VERIFY)) {
fprintf(stderr, "Error, --zerocopy incompatible --verify, "
"used for performance analysis only\n");
usage();
args_fini(args);
return (NULL);
}
/* validate block size(s) */
for (i = 0; i < args->B.val_count; i++) {
int bs = args->B.val[i];
if (bs < MIN_BLKSIZE || bs > MAX_BLKSIZE || !POW_OF_TWO(bs)) {
fprintf(stderr, "Error: invalid block size %d\n", bs);
args_fini(args);
return (NULL);
}
}
return (args);
}
static int
dev_clear(void)
{
zpios_cfg_t cfg;
int rc;
memset(&cfg, 0, sizeof (cfg));
cfg.cfg_magic = ZPIOS_CFG_MAGIC;
cfg.cfg_cmd = ZPIOS_CFG_BUFFER_CLEAR;
cfg.cfg_arg1 = 0;
rc = ioctl(zpiosctl_fd, ZPIOS_CFG, &cfg);
if (rc)
fprintf(stderr, "Ioctl() error %lu / %d: %d\n",
(unsigned long) ZPIOS_CFG, cfg.cfg_cmd, errno);
(void) lseek(zpiosctl_fd, 0, SEEK_SET);
return (rc);
}
/* Passing a size of zero simply results in querying the current size */
static int
dev_size(int size)
{
zpios_cfg_t cfg;
int rc;
memset(&cfg, 0, sizeof (cfg));
cfg.cfg_magic = ZPIOS_CFG_MAGIC;
cfg.cfg_cmd = ZPIOS_CFG_BUFFER_SIZE;
cfg.cfg_arg1 = size;
rc = ioctl(zpiosctl_fd, ZPIOS_CFG, &cfg);
if (rc) {
fprintf(stderr, "Ioctl() error %lu / %d: %d\n",
(unsigned long) ZPIOS_CFG, cfg.cfg_cmd, errno);
return (rc);
}
return (cfg.cfg_rc1);
}
static void
dev_fini(void)
{
if (zpios_buffer)
free(zpios_buffer);
if (zpiosctl_fd != -1) {
if (close(zpiosctl_fd) == -1) {
fprintf(stderr, "Unable to close %s: %d\n",
ZPIOS_DEV, errno);
}
}
}
static int
dev_init(void)
{
int rc;
zpiosctl_fd = open(ZPIOS_DEV, O_RDONLY);
if (zpiosctl_fd == -1) {
fprintf(stderr, "Unable to open %s: %d\n"
"Is the zpios module loaded?\n", ZPIOS_DEV, errno);
rc = errno;
goto error;
}
if ((rc = dev_clear()))
goto error;
if ((rc = dev_size(0)) < 0)
goto error;
zpios_buffer_size = rc;
zpios_buffer = (char *)malloc(zpios_buffer_size);
if (zpios_buffer == NULL) {
rc = ENOMEM;
goto error;
}
memset(zpios_buffer, 0, zpios_buffer_size);
return (0);
error:
if (zpiosctl_fd != -1) {
if (close(zpiosctl_fd) == -1) {
fprintf(stderr, "Unable to close %s: %d\n",
ZPIOS_DEV, errno);
}
}
return (rc);
}
static int
get_next(uint64_t *val, range_repeat_t *range)
{
/* if low, incr, high is given */
if (range->val_count == 0) {
*val = (range->val_low) +
(range->val_low * range->next_val / 100);
if (*val > range->val_high)
return (0); /* No more values, limit exceeded */
if (!range->next_val)
range->next_val = range->val_inc_perc;
else
range->next_val = range->next_val + range->val_inc_perc;
return (1); /* more values to come */
/* if only one val is given */
} else if (range->val_count == 1) {
if (range->next_val)
return (0); /* No more values, we only have one */
*val = range->val[0];
range->next_val = 1;
return (1); /* more values to come */
/* if comma separated values are given */
} else if (range->val_count > 1) {
if (range->next_val > range->val_count - 1)
return (0); /* No more values, limit exceeded */
*val = range->val[range->next_val];
range->next_val++;
return (1); /* more values to come */
}
return (0);
}
static int
run_one(cmd_args_t *args, uint32_t id, uint32_t T, uint32_t N,
uint64_t C, uint64_t S, uint64_t O, uint64_t B)
{
zpios_cmd_t *cmd;
int rc, rc2, cmd_size;
dev_clear();
cmd_size = sizeof (zpios_cmd_t) +
((T + N + 1) * sizeof (zpios_stats_t));
cmd = (zpios_cmd_t *)malloc(cmd_size);
if (cmd == NULL)
return (ENOMEM);
memset(cmd, 0, cmd_size);
cmd->cmd_magic = ZPIOS_CMD_MAGIC;
snprintf(cmd->cmd_pool, sizeof (cmd->cmd_pool), "%s", args->pool);
snprintf(cmd->cmd_pre, sizeof (cmd->cmd_pre), "%s", args->pre);
snprintf(cmd->cmd_post, sizeof (cmd->cmd_post), "%s", args->post);
snprintf(cmd->cmd_log, sizeof (cmd->cmd_log), "%s", args->log);
cmd->cmd_id = id;
cmd->cmd_chunk_size = C;
cmd->cmd_thread_count = T;
cmd->cmd_region_count = N;
cmd->cmd_region_size = S;
cmd->cmd_offset = O;
cmd->cmd_block_size = B;
cmd->cmd_region_noise = args->regionnoise;
cmd->cmd_chunk_noise = args->chunknoise;
cmd->cmd_thread_delay = args->thread_delay;
cmd->cmd_flags = args->flags;
cmd->cmd_data_size = (T + N + 1) * sizeof (zpios_stats_t);
rc = ioctl(zpiosctl_fd, ZPIOS_CMD, cmd);
if (rc)
args->rc = errno;
print_stats(args, cmd);
if (args->verbose) {
rc2 = read(zpiosctl_fd, zpios_buffer, zpios_buffer_size);
zpios_buffer[zpios_buffer_size - 1] = '\0';
if (rc2 < 0) {
fprintf(stdout, "Error reading results: %d\n", rc2);
} else if ((rc2 > 0) && (strlen(zpios_buffer) > 0)) {
fprintf(stdout, "\n%s\n", zpios_buffer);
fflush(stdout);
}
}
free(cmd);
return (rc);
}
static int
run_offsets(cmd_args_t *args)
{
int rc = 0;
while (rc == 0 && get_next(&args->current_O, &args->O)) {
rc = run_one(args, args->current_id,
args->current_T, args->current_N, args->current_C,
args->current_S, args->current_O, args->current_B);
args->current_id++;
}
args->O.next_val = 0;
return (rc);
}
static int
run_region_counts(cmd_args_t *args)
{
int rc = 0;
while (rc == 0 && get_next((uint64_t *)&args->current_N, &args->N))
rc = run_offsets(args);
args->N.next_val = 0;
return (rc);
}
static int
run_region_sizes(cmd_args_t *args)
{
int rc = 0;
while (rc == 0 && get_next(&args->current_S, &args->S)) {
if (args->current_S < args->current_C) {
fprintf(stderr, "Error: in any run chunksize must "
"be strictly smaller than regionsize.\n");
return (EINVAL);
}
rc = run_region_counts(args);
}
args->S.next_val = 0;
return (rc);
}
static int
run_chunk_sizes(cmd_args_t *args)
{
int rc = 0;
while (rc == 0 && get_next(&args->current_C, &args->C)) {
rc = run_region_sizes(args);
}
args->C.next_val = 0;
return (rc);
}
static int
run_block_sizes(cmd_args_t *args)
{
int rc = 0;
while (rc == 0 && get_next(&args->current_B, &args->B)) {
rc = run_chunk_sizes(args);
}
args->B.next_val = 0;
return (rc);
}
static int
run_thread_counts(cmd_args_t *args)
{
int rc = 0;
while (rc == 0 && get_next((uint64_t *)&args->current_T, &args->T))
rc = run_block_sizes(args);
return (rc);
}
int
main(int argc, char **argv)
{
cmd_args_t *args;
int rc = 0;
/* Argument init and parsing */
if ((args = args_init(argc, argv)) == NULL) {
rc = -1;
goto out;
}
/* Device specific init */
if ((rc = dev_init()))
goto out;
/* Generic kernel version string */
if (args->verbose)
fprintf(stdout, "%s", zpios_version);
print_stats_header(args);
rc = run_thread_counts(args);
out:
if (args != NULL)
args_fini(args);
dev_fini();
return (rc);
}
-476
View File
@@ -1,476 +0,0 @@
/*
* ZPIOS is a heavily modified version of the original PIOS test code.
* It is designed to have the test code running in the Linux kernel
* against ZFS while still being flexibly controlled from user space.
*
* Copyright (C) 2008-2010 Lawrence Livermore National Security, LLC.
* Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER).
* Written by Brian Behlendorf <behlendorf1@llnl.gov>.
* LLNL-CODE-403049
*
* Original PIOS Test Code
* Copyright (C) 2004 Cluster File Systems, Inc.
* Written by Peter Braam <braam@clusterfs.com>
* Atul Vidwansa <atul@clusterfs.com>
* Milind Dumbare <milind@clusterfs.com>
*
* This file is part of ZFS on Linux.
* For details, see <http://zfsonlinux.org/>.
*
* ZPIOS is free software; you can redistribute it and/or modify it
* under the terms of the GNU General Public License as published by the
* Free Software Foundation; either version 2 of the License, or (at your
* option) any later version.
*
* ZPIOS is distributed in the hope that it will be useful, but WITHOUT
* ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
* FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
* for more details.
*
* You should have received a copy of the GNU General Public License along
* with ZPIOS. If not, see <http://www.gnu.org/licenses/>.
*
* Copyright (c) 2015, Intel Corporation.
*/
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <assert.h>
#include <regex.h>
#include "zpios.h"
/* extracts an unsigned int (64) and K,M,G,T from the string */
/* and returns a 64 bit value converted to the proper units */
static int
kmgt_to_uint64(const char *str, uint64_t *val)
{
char *endptr;
int rc = 0;
*val = strtoll(str, &endptr, 0);
if ((str == endptr) && (*val == 0))
return (EINVAL);
switch (endptr[0]) {
case 'k': case 'K':
*val = (*val) << 10;
break;
case 'm': case 'M':
*val = (*val) << 20;
break;
case 'g': case 'G':
*val = (*val) << 30;
break;
case 't': case 'T':
*val = (*val) << 40;
break;
case '\0':
break;
default:
rc = EINVAL;
}
return (rc);
}
static char *
uint64_to_kmgt(char *str, uint64_t val)
{
char postfix[] = "kmgt";
int i = -1;
while ((val >= KB) && (i < 4)) {
val = (val >> 10);
i++;
}
if (i >= 4)
(void) snprintf(str, KMGT_SIZE-1, "inf");
else
(void) snprintf(str, KMGT_SIZE-1, "%lu%c", (unsigned long)val,
(i == -1) ? '\0' : postfix[i]);
return (str);
}
static char *
kmgt_per_sec(char *str, uint64_t v, double t)
{
char postfix[] = "kmgt";
double val = ((double)v) / t;
int i = -1;
while ((val >= (double)KB) && (i < 4)) {
val /= (double)KB;
i++;
}
if (i >= 4)
(void) snprintf(str, KMGT_SIZE-1, "inf");
else
(void) snprintf(str, KMGT_SIZE-1, "%.2f%c", val,
(i == -1) ? '\0' : postfix[i]);
return (str);
}
static char *
print_flags(char *str, uint32_t flags)
{
str[0] = (flags & DMU_WRITE) ? 'w' : '-';
str[1] = (flags & DMU_READ) ? 'r' : '-';
str[2] = (flags & DMU_VERIFY) ? 'v' : '-';
str[3] = (flags & DMU_REMOVE) ? 'c' : '-';
str[4] = (flags & DMU_FPP) ? 'p' : 's';
str[5] = (flags & (DMU_WRITE_ZC | DMU_READ_ZC)) ? 'z' : '-';
str[6] = (flags & DMU_WRITE_NOWAIT) ? 'O' : '-';
str[7] = '\0';
return (str);
}
static int
regex_match(const char *string, char *pattern)
{
regex_t re = { 0 };
int rc;
rc = regcomp(&re, pattern, REG_EXTENDED | REG_NOSUB | REG_ICASE);
if (rc) {
fprintf(stderr, "Error: Couldn't do regcomp, %d\n", rc);
return (rc);
}
rc = regexec(&re, string, (size_t)0, NULL, 0);
regfree(&re);
return (rc);
}
/* fills the pios_range_repeat structure of comma separated values */
static int
split_string(const char *optarg, char *pattern, range_repeat_t *range)
{
const char comma[] = ",";
char *cp, *token[32];
int rc, i = 0;
if ((rc = regex_match(optarg, pattern)))
return (rc);
cp = strdup(optarg);
if (cp == NULL)
return (ENOMEM);
do {
/*
* STRTOK(3) Each subsequent call, with a null pointer as the
* value of the * first argument, starts searching from the
* saved pointer and behaves as described above.
*/
if (i == 0) {
token[i] = strtok(cp, comma);
} else {
token[i] = strtok(NULL, comma);
}
} while ((token[i++] != NULL) && (i < 32));
range->val_count = i - 1;
for (i = 0; i < range->val_count; i++)
kmgt_to_uint64(token[i], &range->val[i]);
free(cp);
return (0);
}
int
set_count(char *pattern1, char *pattern2, range_repeat_t *range,
char *optarg, uint32_t *flags, char *arg)
{
uint64_t count = range->val_count;
if (flags)
*flags |= FLAG_SET;
range->next_val = 0;
if (regex_match(optarg, pattern1) == 0) {
kmgt_to_uint64(optarg, &range->val[0]);
range->val_count = 1;
} else if (split_string(optarg, pattern2, range) < 0) {
fprintf(stderr, "Error: Incorrect pattern for %s, '%s'\n",
arg, optarg);
return (EINVAL);
} else if (count == range->val_count) {
fprintf(stderr, "Error: input ignored for %s, '%s'\n",
arg, optarg);
}
return (0);
}
/*
* Validates the value with regular expression and sets low, high, incr
* according to value at which flag will be set. Sets the flag after.
*/
int
set_lhi(char *pattern, range_repeat_t *range, char *optarg,
int flag, uint32_t *flag_thread, char *arg)
{
int rc;
if ((rc = regex_match(optarg, pattern))) {
fprintf(stderr, "Error: Wrong pattern in %s, '%s'\n",
arg, optarg);
return (rc);
}
switch (flag) {
case FLAG_LOW:
kmgt_to_uint64(optarg, &range->val_low);
break;
case FLAG_HIGH:
kmgt_to_uint64(optarg, &range->val_high);
break;
case FLAG_INCR:
kmgt_to_uint64(optarg, &range->val_inc_perc);
break;
default:
assert(0);
}
*flag_thread |= flag;
return (0);
}
int
set_noise(uint64_t *noise, char *optarg, char *arg)
{
if (regex_match(optarg, REGEX_NUMBERS) == 0) {
kmgt_to_uint64(optarg, noise);
} else {
fprintf(stderr, "Error: Incorrect pattern for %s\n", arg);
return (EINVAL);
}
return (0);
}
int
set_load_params(cmd_args_t *args, char *optarg)
{
char *param, *search, *searchdup, comma[] = ",";
int rc = 0;
search = strdup(optarg);
if (search == NULL)
return (ENOMEM);
searchdup = search;
while ((param = strtok(search, comma)) != NULL) {
search = NULL;
if (strcmp("fpp", param) == 0) {
args->flags |= DMU_FPP; /* File Per Process/Thread */
} else if (strcmp("ssf", param) == 0) {
args->flags &= ~DMU_FPP; /* Single Shared File */
} else if (strcmp("dmuio", param) == 0) {
args->io_type |= DMU_IO;
args->flags |= (DMU_WRITE | DMU_READ);
} else {
fprintf(stderr, "Invalid load: %s\n", param);
rc = EINVAL;
}
}
free(searchdup);
return (rc);
}
/*
* Checks the low, high, increment values against the single value for
* mutual exclusion, for e.g threadcount is mutually exclusive to
* threadcount_low, ..._high, ..._incr
*/
int
check_mutual_exclusive_command_lines(uint32_t flag, char *arg)
{
if ((flag & FLAG_SET) && (flag & (FLAG_LOW | FLAG_HIGH | FLAG_INCR))) {
fprintf(stderr, "Error: --%s can not be given with --%s_low, "
"--%s_high or --%s_incr.\n", arg, arg, arg, arg);
return (0);
}
if ((flag & (FLAG_LOW | FLAG_HIGH | FLAG_INCR)) && !(flag & FLAG_SET)) {
if (flag != (FLAG_LOW | FLAG_HIGH | FLAG_INCR)) {
fprintf(stderr, "Error: One or more values missing "
"from --%s_low, --%s_high, --%s_incr.\n",
arg, arg, arg);
return (0);
}
}
return (1);
}
void
print_stats_header(cmd_args_t *args)
{
if (args->verbose) {
printf(
"status name id\tth-cnt\trg-cnt\trg-sz\t"
"ch-sz\toffset\trg-no\tch-no\tth-dly\tflags\tblksz\ttime\t"
"cr-time\trm-time\twr-time\trd-time\twr-data\twr-ch\t"
"wr-bw\trd-data\trd-ch\trd-bw\n");
printf(
"-------------------------------------------------"
"-------------------------------------------------"
"-------------------------------------------------"
"--------------------------------------------------\n");
} else {
printf(
"status name id\t"
"wr-data\twr-ch\twr-bw\t"
"rd-data\trd-ch\trd-bw\n");
printf(
"-----------------------------------------"
"--------------------------------------\n");
}
}
static void
print_stats_human_readable(cmd_args_t *args, zpios_cmd_t *cmd)
{
zpios_stats_t *summary_stats;
double t_time, wr_time, rd_time, cr_time, rm_time;
char str[KMGT_SIZE];
if (args->rc)
printf("FAIL: %3d ", args->rc);
else
printf("PASS: ");
printf("%-12s", args->name ? args->name : ZPIOS_NAME);
printf("%2u\t", cmd->cmd_id);
if (args->verbose) {
printf("%u\t", cmd->cmd_thread_count);
printf("%u\t", cmd->cmd_region_count);
printf("%s\t", uint64_to_kmgt(str, cmd->cmd_region_size));
printf("%s\t", uint64_to_kmgt(str, cmd->cmd_chunk_size));
printf("%s\t", uint64_to_kmgt(str, cmd->cmd_offset));
printf("%s\t", uint64_to_kmgt(str, cmd->cmd_region_noise));
printf("%s\t", uint64_to_kmgt(str, cmd->cmd_chunk_noise));
printf("%s\t", uint64_to_kmgt(str, cmd->cmd_thread_delay));
printf("%s\t", print_flags(str, cmd->cmd_flags));
printf("%s\t", uint64_to_kmgt(str, cmd->cmd_block_size));
}
if (args->rc) {
printf("\n");
return;
}
summary_stats = (zpios_stats_t *)cmd->cmd_data_str;
t_time = zpios_timespec_to_double(summary_stats->total_time.delta);
wr_time = zpios_timespec_to_double(summary_stats->wr_time.delta);
rd_time = zpios_timespec_to_double(summary_stats->rd_time.delta);
cr_time = zpios_timespec_to_double(summary_stats->cr_time.delta);
rm_time = zpios_timespec_to_double(summary_stats->rm_time.delta);
if (args->verbose) {
printf("%.2f\t", t_time);
printf("%.3f\t", cr_time);
printf("%.3f\t", rm_time);
printf("%.2f\t", wr_time);
printf("%.2f\t", rd_time);
}
printf("%s\t", uint64_to_kmgt(str, summary_stats->wr_data));
printf("%s\t", uint64_to_kmgt(str, summary_stats->wr_chunks));
printf("%s\t", kmgt_per_sec(str, summary_stats->wr_data, wr_time));
printf("%s\t", uint64_to_kmgt(str, summary_stats->rd_data));
printf("%s\t", uint64_to_kmgt(str, summary_stats->rd_chunks));
printf("%s\n", kmgt_per_sec(str, summary_stats->rd_data, rd_time));
fflush(stdout);
}
static void
print_stats_table(cmd_args_t *args, zpios_cmd_t *cmd)
{
zpios_stats_t *summary_stats;
double wr_time, rd_time;
if (args->rc)
printf("FAIL: %3d ", args->rc);
else
printf("PASS: ");
printf("%-12s", args->name ? args->name : ZPIOS_NAME);
printf("%2u\t", cmd->cmd_id);
if (args->verbose) {
printf("%u\t", cmd->cmd_thread_count);
printf("%u\t", cmd->cmd_region_count);
printf("%llu\t", (long long unsigned)cmd->cmd_region_size);
printf("%llu\t", (long long unsigned)cmd->cmd_chunk_size);
printf("%llu\t", (long long unsigned)cmd->cmd_offset);
printf("%u\t", cmd->cmd_region_noise);
printf("%u\t", cmd->cmd_chunk_noise);
printf("%u\t", cmd->cmd_thread_delay);
printf("0x%x\t", cmd->cmd_flags);
printf("%u\t", cmd->cmd_block_size);
}
if (args->rc) {
printf("\n");
return;
}
summary_stats = (zpios_stats_t *)cmd->cmd_data_str;
wr_time = zpios_timespec_to_double(summary_stats->wr_time.delta);
rd_time = zpios_timespec_to_double(summary_stats->rd_time.delta);
if (args->verbose) {
printf("%ld.%02ld\t",
(long)summary_stats->total_time.delta.ts_sec,
(long)summary_stats->total_time.delta.ts_nsec);
printf("%ld.%02ld\t",
(long)summary_stats->cr_time.delta.ts_sec,
(long)summary_stats->cr_time.delta.ts_nsec);
printf("%ld.%02ld\t",
(long)summary_stats->rm_time.delta.ts_sec,
(long)summary_stats->rm_time.delta.ts_nsec);
printf("%ld.%02ld\t",
(long)summary_stats->wr_time.delta.ts_sec,
(long)summary_stats->wr_time.delta.ts_nsec);
printf("%ld.%02ld\t",
(long)summary_stats->rd_time.delta.ts_sec,
(long)summary_stats->rd_time.delta.ts_nsec);
}
printf("%lld\t", (long long unsigned)summary_stats->wr_data);
printf("%lld\t", (long long unsigned)summary_stats->wr_chunks);
printf("%.4f\t", (double)summary_stats->wr_data / wr_time);
printf("%lld\t", (long long unsigned)summary_stats->rd_data);
printf("%lld\t", (long long unsigned)summary_stats->rd_chunks);
printf("%.4f\n", (double)summary_stats->rd_data / rd_time);
fflush(stdout);
}
void
print_stats(cmd_args_t *args, zpios_cmd_t *cmd)
{
if (args->human_readable)
print_stats_human_readable(args, cmd);
else
print_stats_table(args, cmd);
}
+4 -5
View File
@@ -16,13 +16,12 @@ zpool_SOURCES = \
zpool_LDADD = \
$(top_builddir)/lib/libnvpair/libnvpair.la \
$(top_builddir)/lib/libuutil/libuutil.la \
$(top_builddir)/lib/libzpool/libzpool.la \
$(top_builddir)/lib/libzfs/libzfs.la \
$(top_builddir)/lib/libzfs_core/libzfs_core.la \
-lm $(LIBBLKID)
$(top_builddir)/lib/libzfs/libzfs.la
zpool_LDADD += -lm $(LIBBLKID)
zpoolconfdir = $(sysconfdir)/zfs/zpool.d
zpoolexecdir = $(libexecdir)/zfs/zpool.d
zpoolexecdir = $(zfsexecdir)/zpool.d
EXTRA_DIST = zpool.d/README
+3 -3
View File
@@ -69,7 +69,7 @@ if [ "$1" = "-h" ] ; then
exit
fi
smartctl_path=$(which smartctl)
smartctl_path=$(command -v smartctl)
if [ -b "$VDEV_UPATH" ] && [ -x "$smartctl_path" ] || [ -n "$samples" ] ; then
if [ -n "$samples" ] ; then
@@ -228,7 +228,7 @@ smart_test)
esac
with_vals=$(echo "$out" | grep -E "$scripts")
if [ ! -z "$with_vals" ]; then
if [ -n "$with_vals" ]; then
echo "$with_vals"
without_vals=$(echo "$scripts" | tr "|" "\n" |
grep -v -E "$(echo "$with_vals" |
@@ -237,6 +237,6 @@ else
without_vals=$(echo "$scripts" | tr "|" "\n" | awk '{print $0"="}')
fi
if [ ! -z "$without_vals" ]; then
if [ -n "$without_vals" ]; then
echo "$without_vals"
fi
+9 -20
View File
@@ -33,8 +33,10 @@
#include <stdio.h>
#include <stdlib.h>
#include <strings.h>
#include <thread_pool.h>
#include <libzfs.h>
#include <libzutil.h>
#include <sys/zfs_context.h>
#include <sys/wait.h>
@@ -668,34 +670,21 @@ all_pools_for_each_vdev_gather_cb(zpool_handle_t *zhp, void *cb_vcdl)
static void
all_pools_for_each_vdev_run_vcdl(vdev_cmd_data_list_t *vcdl)
{
taskq_t *t;
int i;
/* 5 * boot_ncpus selfishly chosen since it works best on LLNL's HW */
int max_threads = 5 * boot_ncpus;
tpool_t *t;
/*
* Under Linux we use a taskq to parallelize running a command
* on each vdev. It is therefore necessary to initialize this
* functionality for the duration of the threads.
*/
thread_init();
t = taskq_create("z_pool_cmd", max_threads, defclsyspri, max_threads,
INT_MAX, 0);
t = tpool_create(1, 5 * sysconf(_SC_NPROCESSORS_ONLN), 0, NULL);
if (t == NULL)
return;
/* Spawn off the command for each vdev */
for (i = 0; i < vcdl->count; i++) {
(void) taskq_dispatch(t, vdev_run_cmd_thread,
(void *) &vcdl->data[i], TQ_SLEEP);
for (int i = 0; i < vcdl->count; i++) {
(void) tpool_dispatch(t, vdev_run_cmd_thread,
(void *) &vcdl->data[i]);
}
/* Wait for threads to finish */
taskq_wait(t);
taskq_destroy(t);
thread_fini();
tpool_wait(t);
tpool_destroy(t);
}
/*
+1616 -325
View File
File diff suppressed because it is too large Load Diff
+26
View File
@@ -111,3 +111,29 @@ isnumber(char *str)
return (1);
}
/*
* Find highest one bit set.
* Returns bit number + 1 of highest bit that is set, otherwise returns 0.
*/
int
highbit64(uint64_t i)
{
if (i == 0)
return (0);
return (NBBY * sizeof (uint64_t) - __builtin_clzll(i));
}
/*
* Find lowest one bit set.
* Returns bit number + 1 of lowest bit that is set, otherwise returns 0.
*/
int
lowbit64(uint64_t i)
{
if (i == 0)
return (0);
return (__builtin_ffsll(i));
}
+2
View File
@@ -43,6 +43,8 @@ void zpool_no_memory(void);
uint_t num_logs(nvlist_t *nv);
uint64_t array64_max(uint64_t array[], unsigned int len);
int isnumber(char *str);
int highbit64(uint64_t i);
int lowbit64(uint64_t i);
/*
* Misc utility functions
+114 -13
View File
@@ -21,8 +21,8 @@
/*
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
* Copyright (c) 2013, 2015 by Delphix. All rights reserved.
* Copyright (c) 2016 Intel Corporation.
* Copyright (c) 2013, 2018 by Delphix. All rights reserved.
* Copyright (c) 2016, 2017 Intel Corporation.
* Copyright 2016 Igor Kozhukhov <ikozhukhov@gmail.com>.
*/
@@ -69,6 +69,7 @@
#include <fcntl.h>
#include <libintl.h>
#include <libnvpair.h>
#include <libzutil.h>
#include <limits.h>
#include <sys/spa.h>
#include <scsi/scsi.h>
@@ -188,6 +189,7 @@ static vdev_disk_db_entry_t vdev_disk_database[] = {
{"ATA INTEL SSDSC2BB60", 4096},
{"ATA INTEL SSDSC2BB80", 4096},
{"ATA INTEL SSDSC2BW24", 4096},
{"ATA INTEL SSDSC2BW48", 4096},
{"ATA INTEL SSDSC2BP24", 4096},
{"ATA INTEL SSDSC2BP48", 4096},
{"NA SmrtStorSDLKAE9W", 4096},
@@ -418,14 +420,20 @@ check_disk(const char *path, blkid_cache cache, int force,
char slice_path[MAXPATHLEN];
int err = 0;
int fd, i;
int flags = O_RDONLY|O_DIRECT;
if (!iswholedisk)
return (check_slice(path, cache, force, isspare));
if ((fd = open(path, O_RDONLY|O_DIRECT|O_EXCL)) < 0) {
/* only spares can be shared, other devices require exclusive access */
if (!isspare)
flags |= O_EXCL;
if ((fd = open(path, flags)) < 0) {
char *value = blkid_get_tag_value(cache, "TYPE", path);
(void) fprintf(stderr, gettext("%s is in use and contains "
"a %s filesystem.\n"), path, value ? value : "unknown");
free(value);
return (-1);
}
@@ -546,7 +554,7 @@ is_spare(nvlist_t *config, const char *path)
uint_t i, nspares;
boolean_t inuse;
if ((fd = open(path, O_RDONLY)) < 0)
if ((fd = open(path, O_RDONLY|O_DIRECT)) < 0)
return (B_FALSE);
if (zpool_in_use(g_zfs, fd, &state, &name, &inuse) != 0 ||
@@ -683,6 +691,9 @@ make_leaf_vdev(nvlist_t *props, const char *arg, uint64_t is_log)
verify(nvlist_add_string(vdev, ZPOOL_CONFIG_PATH, path) == 0);
verify(nvlist_add_string(vdev, ZPOOL_CONFIG_TYPE, type) == 0);
verify(nvlist_add_uint64(vdev, ZPOOL_CONFIG_IS_LOG, is_log) == 0);
if (is_log)
verify(nvlist_add_string(vdev, ZPOOL_CONFIG_ALLOCATION_BIAS,
VDEV_ALLOC_BIAS_LOG) == 0);
if (strcmp(type, VDEV_TYPE_DISK) == 0)
verify(nvlist_add_uint64(vdev, ZPOOL_CONFIG_WHOLE_DISK,
(uint64_t)wholedisk) == 0);
@@ -741,6 +752,9 @@ make_leaf_vdev(nvlist_t *props, const char *arg, uint64_t is_log)
*
* Otherwise, make sure that the current spec (if there is one) and the new
* spec have consistent replication levels.
*
* If there is no current spec (create), make sure new spec has at least
* one general purpose vdev.
*/
typedef struct replication_level {
char *zprl_type;
@@ -964,7 +978,7 @@ get_replication(nvlist_t *nvroot, boolean_t fatal)
/*
* At this point, we have the replication of the last toplevel
* vdev in 'rep'. Compare it to 'lastrep' to see if its
* vdev in 'rep'. Compare it to 'lastrep' to see if it is
* different.
*/
if (lastrep.zprl_type != NULL) {
@@ -1273,7 +1287,7 @@ make_disks(zpool_handle_t *zhp, nvlist_t *nv)
* symbolic link will be removed, partition table created,
* and then block until udev creates the new link.
*/
if (!is_exclusive || !is_spare(NULL, udevpath)) {
if (!is_exclusive && !is_spare(NULL, udevpath)) {
char *devnode = strrchr(devpath, '/') + 1;
ret = strncmp(udevpath, UDISK_ROOT, strlen(UDISK_ROOT));
@@ -1465,6 +1479,13 @@ is_grouping(const char *type, int *mindev, int *maxdev)
return (VDEV_TYPE_LOG);
}
if (strcmp(type, VDEV_ALLOC_BIAS_SPECIAL) == 0 ||
strcmp(type, VDEV_ALLOC_BIAS_DEDUP) == 0) {
if (mindev != NULL)
*mindev = 1;
return (type);
}
if (strcmp(type, "cache") == 0) {
if (mindev != NULL)
*mindev = 1;
@@ -1486,7 +1507,7 @@ construct_spec(nvlist_t *props, int argc, char **argv)
nvlist_t *nvroot, *nv, **top, **spares, **l2cache;
int t, toplevels, mindev, maxdev, nspares, nlogs, nl2cache;
const char *type;
uint64_t is_log;
uint64_t is_log, is_special, is_dedup;
boolean_t seen_logs;
top = NULL;
@@ -1496,7 +1517,7 @@ construct_spec(nvlist_t *props, int argc, char **argv)
nspares = 0;
nlogs = 0;
nl2cache = 0;
is_log = B_FALSE;
is_log = is_special = is_dedup = B_FALSE;
seen_logs = B_FALSE;
nvroot = NULL;
@@ -1519,7 +1540,7 @@ construct_spec(nvlist_t *props, int argc, char **argv)
"specified only once\n"));
goto spec_out;
}
is_log = B_FALSE;
is_log = is_special = is_dedup = B_FALSE;
}
if (strcmp(type, VDEV_TYPE_LOG) == 0) {
@@ -1532,6 +1553,8 @@ construct_spec(nvlist_t *props, int argc, char **argv)
}
seen_logs = B_TRUE;
is_log = B_TRUE;
is_special = B_FALSE;
is_dedup = B_FALSE;
argc--;
argv++;
/*
@@ -1541,6 +1564,24 @@ construct_spec(nvlist_t *props, int argc, char **argv)
continue;
}
if (strcmp(type, VDEV_ALLOC_BIAS_SPECIAL) == 0) {
is_special = B_TRUE;
is_log = B_FALSE;
is_dedup = B_FALSE;
argc--;
argv++;
continue;
}
if (strcmp(type, VDEV_ALLOC_BIAS_DEDUP) == 0) {
is_dedup = B_TRUE;
is_log = B_FALSE;
is_special = B_FALSE;
argc--;
argv++;
continue;
}
if (strcmp(type, VDEV_TYPE_L2CACHE) == 0) {
if (l2cache != NULL) {
(void) fprintf(stderr,
@@ -1549,15 +1590,16 @@ construct_spec(nvlist_t *props, int argc, char **argv)
"specified only once\n"));
goto spec_out;
}
is_log = B_FALSE;
is_log = is_special = is_dedup = B_FALSE;
}
if (is_log) {
if (is_log || is_special || is_dedup) {
if (strcmp(type, VDEV_TYPE_MIRROR) != 0) {
(void) fprintf(stderr,
gettext("invalid vdev "
"specification: unsupported 'log' "
"device: %s\n"), type);
"specification: unsupported '%s' "
"device: %s\n"), is_log ? "log" :
"special", type);
goto spec_out;
}
nlogs++;
@@ -1614,12 +1656,27 @@ construct_spec(nvlist_t *props, int argc, char **argv)
nl2cache = children;
continue;
} else {
/* create a top-level vdev with children */
verify(nvlist_alloc(&nv, NV_UNIQUE_NAME,
0) == 0);
verify(nvlist_add_string(nv, ZPOOL_CONFIG_TYPE,
type) == 0);
verify(nvlist_add_uint64(nv,
ZPOOL_CONFIG_IS_LOG, is_log) == 0);
if (is_log)
verify(nvlist_add_string(nv,
ZPOOL_CONFIG_ALLOCATION_BIAS,
VDEV_ALLOC_BIAS_LOG) == 0);
if (is_special) {
verify(nvlist_add_string(nv,
ZPOOL_CONFIG_ALLOCATION_BIAS,
VDEV_ALLOC_BIAS_SPECIAL) == 0);
}
if (is_dedup) {
verify(nvlist_add_string(nv,
ZPOOL_CONFIG_ALLOCATION_BIAS,
VDEV_ALLOC_BIAS_DEDUP) == 0);
}
if (strcmp(type, VDEV_TYPE_RAIDZ) == 0) {
verify(nvlist_add_uint64(nv,
ZPOOL_CONFIG_NPARITY,
@@ -1644,6 +1701,16 @@ construct_spec(nvlist_t *props, int argc, char **argv)
if (is_log)
nlogs++;
if (is_special) {
verify(nvlist_add_string(nv,
ZPOOL_CONFIG_ALLOCATION_BIAS,
VDEV_ALLOC_BIAS_SPECIAL) == 0);
}
if (is_dedup) {
verify(nvlist_add_string(nv,
ZPOOL_CONFIG_ALLOCATION_BIAS,
VDEV_ALLOC_BIAS_DEDUP) == 0);
}
argc--;
argv++;
}
@@ -1744,6 +1811,30 @@ split_mirror_vdev(zpool_handle_t *zhp, char *newname, nvlist_t *props,
return (newroot);
}
static int
num_normal_vdevs(nvlist_t *nvroot)
{
nvlist_t **top;
uint_t t, toplevels, normal = 0;
verify(nvlist_lookup_nvlist_array(nvroot, ZPOOL_CONFIG_CHILDREN,
&top, &toplevels) == 0);
for (t = 0; t < toplevels; t++) {
uint64_t log = B_FALSE;
(void) nvlist_lookup_uint64(top[t], ZPOOL_CONFIG_IS_LOG, &log);
if (log)
continue;
if (nvlist_exists(top[t], ZPOOL_CONFIG_ALLOCATION_BIAS))
continue;
normal++;
}
return (normal);
}
/*
* Get and validate the contents of the given vdev specification. This ensures
* that the nvlist returned is well-formed, that all the devices exist, and that
@@ -1796,6 +1887,16 @@ make_root_vdev(zpool_handle_t *zhp, nvlist_t *props, int force, int check_rep,
return (NULL);
}
/*
* On pool create the new vdev spec must have one normal vdev.
*/
if (poolconfig == NULL && num_normal_vdevs(newroot) == 0) {
vdev_error(gettext("at least one general top-level vdev must "
"be specified\n"));
nvlist_free(newroot);
return (NULL);
}
/*
* Run through the vdev specification and label any whole disks found.
*/
+1 -4
View File
@@ -11,7 +11,4 @@ zstreamdump_SOURCES = \
zstreamdump_LDADD = \
$(top_builddir)/lib/libnvpair/libnvpair.la \
$(top_builddir)/lib/libuutil/libuutil.la \
$(top_builddir)/lib/libzpool/libzpool.la \
$(top_builddir)/lib/libzfs/libzfs.la \
$(top_builddir)/lib/libzfs_core/libzfs_core.la
$(top_builddir)/lib/libzfs/libzfs.la
+174 -51
View File
@@ -53,7 +53,6 @@
*/
#define DUMP_GROUPING 4
uint64_t total_write_size = 0;
uint64_t total_stream_len = 0;
FILE *send_stream = 0;
boolean_t do_byteswap = B_FALSE;
@@ -197,12 +196,36 @@ print_block(char *buf, int length)
}
}
/*
* Print an array of bytes to stdout as hexidecimal characters. str must
* have buf_len * 2 + 1 bytes of space.
*/
static void
sprintf_bytes(char *str, uint8_t *buf, uint_t buf_len)
{
int i, n;
for (i = 0; i < buf_len; i++) {
n = sprintf(str, "%02x", buf[i] & 0xff);
str += n;
}
str[0] = '\0';
}
int
main(int argc, char *argv[])
{
char *buf = safe_malloc(SPA_MAXBLOCKSIZE);
uint64_t drr_record_count[DRR_NUMTYPES] = { 0 };
uint64_t total_payload_size = 0;
uint64_t total_overhead_size = 0;
uint64_t drr_byte_count[DRR_NUMTYPES] = { 0 };
char salt[ZIO_DATA_SALT_LEN * 2 + 1];
char iv[ZIO_DATA_IV_LEN * 2 + 1];
char mac[ZIO_DATA_MAC_LEN * 2 + 1];
uint64_t total_records = 0;
uint64_t payload_size;
dmu_replay_record_t thedrr;
dmu_replay_record_t *drr = &thedrr;
struct drr_begin *drrb = &thedrr.drr_u.drr_begin;
@@ -214,8 +237,9 @@ main(int argc, char *argv[])
struct drr_free *drrf = &thedrr.drr_u.drr_free;
struct drr_spill *drrs = &thedrr.drr_u.drr_spill;
struct drr_write_embedded *drrwe = &thedrr.drr_u.drr_write_embedded;
struct drr_object_range *drror = &thedrr.drr_u.drr_object_range;
struct drr_checksum *drrc = &thedrr.drr_u.drr_checksum;
char c;
int c;
boolean_t verbose = B_FALSE;
boolean_t very_verbose = B_FALSE;
boolean_t first = B_TRUE;
@@ -314,7 +338,9 @@ main(int argc, char *argv[])
}
drr_record_count[drr->drr_type]++;
total_overhead_size += sizeof (*drr);
total_records++;
payload_size = 0;
switch (drr->drr_type) {
case DRR_BEGIN:
@@ -362,10 +388,13 @@ main(int argc, char *argv[])
if (ferror(send_stream))
perror("fread");
err = nvlist_unpack(buf, sz, &nv, 0);
if (err)
if (err) {
perror(strerror(err));
nvlist_print(stdout, nv);
nvlist_free(nv);
} else {
nvlist_print(stdout, nv);
nvlist_free(nv);
}
payload_size = sz;
}
break;
@@ -418,26 +447,39 @@ main(int argc, char *argv[])
drro->drr_blksz = BSWAP_32(drro->drr_blksz);
drro->drr_bonuslen =
BSWAP_32(drro->drr_bonuslen);
drro->drr_raw_bonuslen =
BSWAP_32(drro->drr_raw_bonuslen);
drro->drr_toguid = BSWAP_64(drro->drr_toguid);
drro->drr_maxblkid =
BSWAP_64(drro->drr_maxblkid);
}
payload_size = DRR_OBJECT_PAYLOAD_SIZE(drro);
if (verbose) {
(void) printf("OBJECT object = %llu type = %u "
"bonustype = %u blksz = %u bonuslen = %u "
"dn_slots = %u\n",
"dn_slots = %u raw_bonuslen = %u "
"flags = %u maxblkid = %llu "
"indblkshift = %u nlevels = %u "
"nblkptr = %u\n",
(u_longlong_t)drro->drr_object,
drro->drr_type,
drro->drr_bonustype,
drro->drr_blksz,
drro->drr_bonuslen,
drro->drr_dn_slots);
drro->drr_dn_slots,
drro->drr_raw_bonuslen,
drro->drr_flags,
(u_longlong_t)drro->drr_maxblkid,
drro->drr_indblkshift,
drro->drr_nlevels,
drro->drr_nblkptr);
}
if (drro->drr_bonuslen > 0) {
(void) ssread(buf,
P2ROUNDUP(drro->drr_bonuslen, 8), &zc);
if (dump) {
print_block(buf,
P2ROUNDUP(drro->drr_bonuslen, 8));
}
(void) ssread(buf, payload_size, &zc);
if (dump)
print_block(buf, payload_size);
}
break;
@@ -471,28 +513,40 @@ main(int argc, char *argv[])
BSWAP_64(drrw->drr_compressed_size);
}
uint64_t payload_size = DRR_WRITE_PAYLOAD_SIZE(drrw);
payload_size = DRR_WRITE_PAYLOAD_SIZE(drrw);
/*
* If this is verbose and/or dump output,
* print info on the modified block
*/
if (verbose) {
sprintf_bytes(salt, drrw->drr_salt,
ZIO_DATA_SALT_LEN);
sprintf_bytes(iv, drrw->drr_iv,
ZIO_DATA_IV_LEN);
sprintf_bytes(mac, drrw->drr_mac,
ZIO_DATA_MAC_LEN);
(void) printf("WRITE object = %llu type = %u "
"checksum type = %u compression type = %u\n"
" offset = %llu logical_size = %llu "
"checksum type = %u compression type = %u "
"flags = %u offset = %llu "
"logical_size = %llu "
"compressed_size = %llu "
"payload_size = %llu "
"props = %llx\n",
"payload_size = %llu props = %llx "
"salt = %s iv = %s mac = %s\n",
(u_longlong_t)drrw->drr_object,
drrw->drr_type,
drrw->drr_checksumtype,
drrw->drr_compressiontype,
drrw->drr_flags,
(u_longlong_t)drrw->drr_offset,
(u_longlong_t)drrw->drr_logical_size,
(u_longlong_t)drrw->drr_compressed_size,
(u_longlong_t)payload_size,
(u_longlong_t)drrw->drr_key.ddk_prop);
(u_longlong_t)drrw->drr_key.ddk_prop,
salt,
iv,
mac);
}
/*
@@ -505,7 +559,6 @@ main(int argc, char *argv[])
if (dump) {
print_block(buf, payload_size);
}
total_write_size += payload_size;
break;
case DRR_WRITE_BYREF:
@@ -529,10 +582,10 @@ main(int argc, char *argv[])
}
if (verbose) {
(void) printf("WRITE_BYREF object = %llu "
"checksum type = %u props = %llx\n"
" offset = %llu length = %llu\n"
"toguid = %llx refguid = %llx\n"
" refobject = %llu refoffset = %llu\n",
"checksum type = %u props = %llx "
"offset = %llu length = %llu "
"toguid = %llx refguid = %llx "
"refobject = %llu refoffset = %llu\n",
(u_longlong_t)drrwbr->drr_object,
drrwbr->drr_checksumtype,
(u_longlong_t)drrwbr->drr_key.ddk_prop,
@@ -563,16 +616,40 @@ main(int argc, char *argv[])
if (do_byteswap) {
drrs->drr_object = BSWAP_64(drrs->drr_object);
drrs->drr_length = BSWAP_64(drrs->drr_length);
drrs->drr_compressed_size =
BSWAP_64(drrs->drr_compressed_size);
drrs->drr_type = BSWAP_32(drrs->drr_type);
}
payload_size = DRR_SPILL_PAYLOAD_SIZE(drrs);
if (verbose) {
sprintf_bytes(salt, drrs->drr_salt,
ZIO_DATA_SALT_LEN);
sprintf_bytes(iv, drrs->drr_iv,
ZIO_DATA_IV_LEN);
sprintf_bytes(mac, drrs->drr_mac,
ZIO_DATA_MAC_LEN);
(void) printf("SPILL block for object = %llu "
"length = %llu\n",
(long long unsigned int)drrs->drr_object,
(long long unsigned int)drrs->drr_length);
"length = %llu flags = %u "
"compression type = %u "
"compressed_size = %llu "
"payload_size = %llu "
"salt = %s iv = %s mac = %s\n",
(u_longlong_t)drrs->drr_object,
(u_longlong_t)drrs->drr_length,
drrs->drr_flags,
drrs->drr_compressiontype,
(u_longlong_t)drrs->drr_compressed_size,
(u_longlong_t)payload_size,
salt,
iv,
mac);
}
(void) ssread(buf, drrs->drr_length, &zc);
(void) ssread(buf, payload_size, &zc);
if (dump) {
print_block(buf, drrs->drr_length);
print_block(buf, payload_size);
}
break;
case DRR_WRITE_EMBEDDED:
@@ -592,8 +669,8 @@ main(int argc, char *argv[])
}
if (verbose) {
(void) printf("WRITE_EMBEDDED object = %llu "
"offset = %llu length = %llu\n"
" toguid = %llx comp = %u etype = %u "
"offset = %llu length = %llu "
"toguid = %llx comp = %u etype = %u "
"lsize = %u psize = %u\n",
(u_longlong_t)drrwe->drr_object,
(u_longlong_t)drrwe->drr_offset,
@@ -606,6 +683,38 @@ main(int argc, char *argv[])
}
(void) ssread(buf,
P2ROUNDUP(drrwe->drr_psize, 8), &zc);
if (dump) {
print_block(buf,
P2ROUNDUP(drrwe->drr_psize, 8));
}
payload_size = P2ROUNDUP(drrwe->drr_psize, 8);
break;
case DRR_OBJECT_RANGE:
if (do_byteswap) {
drror->drr_firstobj =
BSWAP_64(drror->drr_firstobj);
drror->drr_numslots =
BSWAP_64(drror->drr_numslots);
drror->drr_toguid = BSWAP_64(drror->drr_toguid);
}
if (verbose) {
sprintf_bytes(salt, drror->drr_salt,
ZIO_DATA_SALT_LEN);
sprintf_bytes(iv, drror->drr_iv,
ZIO_DATA_IV_LEN);
sprintf_bytes(mac, drror->drr_mac,
ZIO_DATA_MAC_LEN);
(void) printf("OBJECT_RANGE firstobj = %llu "
"numslots = %llu flags = %u "
"salt = %s iv = %s mac = %s\n",
(u_longlong_t)drror->drr_firstobj,
(u_longlong_t)drror->drr_numslots,
drror->drr_flags,
salt,
iv,
mac);
}
break;
case DRR_NUMTYPES:
/* should never be reached */
@@ -619,6 +728,8 @@ main(int argc, char *argv[])
(longlong_t)drrc->drr_checksum.zc_word[3]);
}
pcksum = zc;
drr_byte_count[drr->drr_type] += payload_size;
total_payload_size += payload_size;
}
free(buf);
fletcher_4_fini();
@@ -626,28 +737,40 @@ main(int argc, char *argv[])
/* Print final summary */
(void) printf("SUMMARY:\n");
(void) printf("\tTotal DRR_BEGIN records = %lld\n",
(u_longlong_t)drr_record_count[DRR_BEGIN]);
(void) printf("\tTotal DRR_END records = %lld\n",
(u_longlong_t)drr_record_count[DRR_END]);
(void) printf("\tTotal DRR_OBJECT records = %lld\n",
(u_longlong_t)drr_record_count[DRR_OBJECT]);
(void) printf("\tTotal DRR_FREEOBJECTS records = %lld\n",
(u_longlong_t)drr_record_count[DRR_FREEOBJECTS]);
(void) printf("\tTotal DRR_WRITE records = %lld\n",
(u_longlong_t)drr_record_count[DRR_WRITE]);
(void) printf("\tTotal DRR_WRITE_BYREF records = %lld\n",
(u_longlong_t)drr_record_count[DRR_WRITE_BYREF]);
(void) printf("\tTotal DRR_WRITE_EMBEDDED records = %lld\n",
(u_longlong_t)drr_record_count[DRR_WRITE_EMBEDDED]);
(void) printf("\tTotal DRR_FREE records = %lld\n",
(u_longlong_t)drr_record_count[DRR_FREE]);
(void) printf("\tTotal DRR_SPILL records = %lld\n",
(u_longlong_t)drr_record_count[DRR_SPILL]);
(void) printf("\tTotal DRR_BEGIN records = %lld (%llu bytes)\n",
(u_longlong_t)drr_record_count[DRR_BEGIN],
(u_longlong_t)drr_byte_count[DRR_BEGIN]);
(void) printf("\tTotal DRR_END records = %lld (%llu bytes)\n",
(u_longlong_t)drr_record_count[DRR_END],
(u_longlong_t)drr_byte_count[DRR_END]);
(void) printf("\tTotal DRR_OBJECT records = %lld (%llu bytes)\n",
(u_longlong_t)drr_record_count[DRR_OBJECT],
(u_longlong_t)drr_byte_count[DRR_OBJECT]);
(void) printf("\tTotal DRR_FREEOBJECTS records = %lld (%llu bytes)\n",
(u_longlong_t)drr_record_count[DRR_FREEOBJECTS],
(u_longlong_t)drr_byte_count[DRR_FREEOBJECTS]);
(void) printf("\tTotal DRR_WRITE records = %lld (%llu bytes)\n",
(u_longlong_t)drr_record_count[DRR_WRITE],
(u_longlong_t)drr_byte_count[DRR_WRITE]);
(void) printf("\tTotal DRR_WRITE_BYREF records = %lld (%llu bytes)\n",
(u_longlong_t)drr_record_count[DRR_WRITE_BYREF],
(u_longlong_t)drr_byte_count[DRR_WRITE_BYREF]);
(void) printf("\tTotal DRR_WRITE_EMBEDDED records = %lld (%llu "
"bytes)\n", (u_longlong_t)drr_record_count[DRR_WRITE_EMBEDDED],
(u_longlong_t)drr_byte_count[DRR_WRITE_EMBEDDED]);
(void) printf("\tTotal DRR_FREE records = %lld (%llu bytes)\n",
(u_longlong_t)drr_record_count[DRR_FREE],
(u_longlong_t)drr_byte_count[DRR_FREE]);
(void) printf("\tTotal DRR_SPILL records = %lld (%llu bytes)\n",
(u_longlong_t)drr_record_count[DRR_SPILL],
(u_longlong_t)drr_byte_count[DRR_SPILL]);
(void) printf("\tTotal records = %lld\n",
(u_longlong_t)total_records);
(void) printf("\tTotal write size = %lld (0x%llx)\n",
(u_longlong_t)total_write_size, (u_longlong_t)total_write_size);
(void) printf("\tTotal payload size = %lld (0x%llx)\n",
(u_longlong_t)total_payload_size, (u_longlong_t)total_payload_size);
(void) printf("\tTotal header overhead = %lld (0x%llx)\n",
(u_longlong_t)total_overhead_size,
(u_longlong_t)total_overhead_size);
(void) printf("\tTotal stream length = %lld (0x%llx)\n",
(u_longlong_t)total_stream_len, (u_longlong_t)total_stream_len);
return (0);
+10 -8
View File
@@ -1,9 +1,13 @@
include $(top_srcdir)/config/Rules.am
# -Wnoformat-truncation to get rid of compiler warning for unchecked
# truncating snprintfs on gcc 7.1.1.
AM_CFLAGS += $(DEBUG_STACKFLAGS) $(FRAME_LARGER_THAN) $(NO_FORMAT_TRUNCATION)
AM_CPPFLAGS += -DDEBUG
# Get rid of compiler warning for unchecked truncating snprintfs on gcc 7.1.1
AM_CFLAGS += $(NO_FORMAT_TRUNCATION)
# Includes kernel code, generate warnings for large stack frames
AM_CFLAGS += $(FRAME_LARGER_THAN)
# Unconditionally enable ASSERTs
AM_CPPFLAGS += -DDEBUG -UNDEBUG
DEFAULT_INCLUDES += \
-I$(top_srcdir)/include \
@@ -16,9 +20,7 @@ ztest_SOURCES = \
ztest_LDADD = \
$(top_builddir)/lib/libnvpair/libnvpair.la \
$(top_builddir)/lib/libuutil/libuutil.la \
$(top_builddir)/lib/libzpool/libzpool.la \
$(top_builddir)/lib/libzfs/libzfs.la \
$(top_builddir)/lib/libzfs_core/libzfs_core.la
$(top_builddir)/lib/libzpool/libzpool.la
ztest_LDADD += -lm
ztest_LDFLAGS = -pthread
+1268 -494
View File
File diff suppressed because it is too large Load Diff
+1
View File
@@ -0,0 +1 @@
dist_bin_SCRIPTS = zvol_wait
+112
View File
@@ -0,0 +1,112 @@
#!/bin/sh
count_zvols() {
if [ -z "$zvols" ]; then
echo 0
else
echo "$zvols" | wc -l
fi
}
filter_out_zvols_with_links() {
while read -r zvol; do
if [ ! -L "/dev/zvol/$zvol" ]; then
echo "$zvol"
fi
done
}
filter_out_deleted_zvols() {
while read -r zvol; do
if zfs list "$zvol" >/dev/null 2>&1; then
echo "$zvol"
fi
done
}
list_zvols() {
zfs list -t volume -H -o name,volmode,receive_resume_token |
while read -r zvol_line; do
name=$(echo "$zvol_line" | awk '{print $1}')
volmode=$(echo "$zvol_line" | awk '{print $2}')
token=$(echo "$zvol_line" | awk '{print $3}')
#
# /dev links are not created for zvols with volmode = "none".
#
[ "$volmode" = "none" ] && continue
#
# We also also ignore partially received zvols if it is
# not an incremental receive, as those won't even have a block
# device minor node created yet.
#
if [ "$token" != "-" ]; then
#
# Incremental receives create an invisible clone that
# is not automatically displayed by zfs list.
#
if ! zfs list "$name/%recv" >/dev/null 2>&1; then
continue
fi
fi
echo "$name"
done
}
zvols=$(list_zvols)
zvols_count=$(count_zvols)
if [ "$zvols_count" -eq 0 ]; then
echo "No zvols found, nothing to do."
exit 0
fi
echo "Testing $zvols_count zvol links"
outer_loop=0
while [ "$outer_loop" -lt 20 ]; do
outer_loop=$((outer_loop + 1))
old_zvols_count=$(count_zvols)
inner_loop=0
while [ "$inner_loop" -lt 30 ]; do
inner_loop=$((inner_loop + 1))
zvols="$(echo "$zvols" | filter_out_zvols_with_links)"
zvols_count=$(count_zvols)
if [ "$zvols_count" -eq 0 ]; then
echo "All zvol links are now present."
exit 0
fi
sleep 1
done
echo "Still waiting on $zvols_count zvol links ..."
#
# Although zvols should normally not be deleted at boot time,
# if that is the case then their links will be missing and
# we would stall.
#
if [ "$old_zvols_count" -eq "$zvols_count" ]; then
echo "No progress since last loop."
echo "Checking if any zvols were deleted."
zvols=$(echo "$zvols" | filter_out_deleted_zvols)
zvols_count=$(count_zvols)
if [ "$old_zvols_count" -ne "$zvols_count" ]; then
echo "$((old_zvols_count - zvols_count)) zvol(s) deleted."
fi
if [ "$zvols_count" -ne 0 ]; then
echo "Remaining zvols:"
echo "$zvols"
else
echo "All zvol links are now present."
exit 0
fi
fi
done
echo "Timed out waiting on zvol links"
exit 1
+20 -8
View File
@@ -1,18 +1,30 @@
#
# Default build rules for all user space components, every Makefile.am
# should include these rules and override or extend them as needed.
#
DEFAULT_INCLUDES = -include ${top_builddir}/zfs_config.h
AM_LIBTOOLFLAGS = --silent
AM_CFLAGS = ${DEBUG_CFLAGS} -Wall -Wstrict-prototypes
AM_CFLAGS += ${NO_UNUSED_BUT_SET_VARIABLE}
AM_CFLAGS += ${NO_BOOL_COMPARE}
AM_CFLAGS += -fno-strict-aliasing
AM_CFLAGS += -std=gnu99
AM_CFLAGS = -std=gnu99 -Wall -Wstrict-prototypes -fno-strict-aliasing
AM_CFLAGS += $(NO_OMIT_FRAME_POINTER)
AM_CFLAGS += $(DEBUG_CFLAGS)
AM_CFLAGS += $(ASAN_CFLAGS)
AM_CFLAGS += $(CODE_COVERAGE_CFLAGS)
AM_CPPFLAGS = -D_GNU_SOURCE -D__EXTENSIONS__ -D_REENTRANT
AM_CPPFLAGS += -D_POSIX_PTHREAD_SEMANTICS -D_FILE_OFFSET_BITS=64
AM_CPPFLAGS += -D_LARGEFILE64_SOURCE -DHAVE_LARGE_STACKS=1
AM_CPPFLAGS = -D_GNU_SOURCE
AM_CPPFLAGS += -D_REENTRANT
AM_CPPFLAGS += -D_FILE_OFFSET_BITS=64
AM_CPPFLAGS += -D_LARGEFILE64_SOURCE
AM_CPPFLAGS += -DHAVE_LARGE_STACKS=1
AM_CPPFLAGS += -DTEXT_DOMAIN=\"zfs-linux-user\"
AM_CPPFLAGS += -DLIBEXECDIR=\"$(libexecdir)\"
AM_CPPFLAGS += -DRUNSTATEDIR=\"$(runstatedir)\"
AM_CPPFLAGS += -DSBINDIR=\"$(sbindir)\"
AM_CPPFLAGS += -DSYSCONFDIR=\"$(sysconfdir)\"
AM_CPPFLAGS += $(DEBUG_CPPFLAGS)
AM_CPPFLAGS += $(CODE_COVERAGE_CPPFLAGS)
AM_LDFLAGS = $(DEBUG_LDFLAGS)
AM_LDFLAGS += $(ASAN_LDFLAGS)
+162
View File
@@ -0,0 +1,162 @@
dnl #
dnl # Enabled -fsanitize=address if supported by gcc.
dnl #
dnl # LDFLAGS needs -fsanitize=address at all times so libraries compiled with
dnl # it will be linked successfully. CFLAGS will vary by binary being built.
dnl #
dnl # The ASAN_OPTIONS environment variable can be used to further control
dnl # the behavior of binaries and libraries build with -fsanitize=address.
dnl #
AC_DEFUN([ZFS_AC_CONFIG_ALWAYS_CC_ASAN], [
AC_MSG_CHECKING([whether to build with -fsanitize=address support])
AC_ARG_ENABLE([asan],
[AS_HELP_STRING([--enable-asan],
[Enable -fsanitize=address support @<:@default=no@:>@])],
[],
[enable_asan=no])
AM_CONDITIONAL([ASAN_ENABLED], [test x$enable_asan = xyes])
AC_SUBST([ASAN_ENABLED], [$enable_asan])
AC_MSG_RESULT($enable_asan)
AS_IF([ test "$enable_asan" = "yes" ], [
AC_MSG_CHECKING([whether $CC supports -fsanitize=address])
saved_cflags="$CFLAGS"
CFLAGS="$CFLAGS -fsanitize=address"
AC_LINK_IFELSE([
AC_LANG_SOURCE([[ int main() { return 0; } ]])
], [
ASAN_CFLAGS="-fsanitize=address"
ASAN_LDFLAGS="-fsanitize=address"
ASAN_ZFS="_with_asan"
AC_MSG_RESULT([yes])
], [
AC_MSG_ERROR([$CC does not support -fsanitize=address])
])
CFLAGS="$saved_cflags"
], [
ASAN_CFLAGS=""
ASAN_LDFLAGS=""
ASAN_ZFS="_without_asan"
])
AC_SUBST([ASAN_CFLAGS])
AC_SUBST([ASAN_LDFLAGS])
AC_SUBST([ASAN_ZFS])
])
dnl #
dnl # Check if gcc supports -Wframe-larger-than=<size> option.
dnl #
AC_DEFUN([ZFS_AC_CONFIG_ALWAYS_CC_FRAME_LARGER_THAN], [
AC_MSG_CHECKING([whether $CC supports -Wframe-larger-than=<size>])
saved_flags="$CFLAGS"
CFLAGS="$CFLAGS -Wframe-larger-than=4096"
AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [])], [
FRAME_LARGER_THAN="-Wframe-larger-than=4096"
AC_MSG_RESULT([yes])
], [
FRAME_LARGER_THAN=""
AC_MSG_RESULT([no])
])
CFLAGS="$saved_flags"
AC_SUBST([FRAME_LARGER_THAN])
])
dnl #
dnl # Check if gcc supports -Wno-format-truncation option.
dnl #
AC_DEFUN([ZFS_AC_CONFIG_ALWAYS_CC_NO_FORMAT_TRUNCATION], [
AC_MSG_CHECKING([whether $CC supports -Wno-format-truncation])
saved_flags="$CFLAGS"
CFLAGS="$CFLAGS -Wno-format-truncation"
AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [])], [
NO_FORMAT_TRUNCATION=-Wno-format-truncation
AC_MSG_RESULT([yes])
], [
NO_FORMAT_TRUNCATION=
AC_MSG_RESULT([no])
])
CFLAGS="$saved_flags"
AC_SUBST([NO_FORMAT_TRUNCATION])
])
dnl #
dnl # Check if gcc supports -Wno-bool-compare option.
dnl #
dnl # We actually invoke gcc with the -Wbool-compare option
dnl # and infer the 'no-' version does or doesn't exist based upon
dnl # the results. This is required because when checking any of
dnl # no- prefixed options gcc always returns success.
dnl #
AC_DEFUN([ZFS_AC_CONFIG_ALWAYS_CC_NO_BOOL_COMPARE], [
AC_MSG_CHECKING([whether $CC supports -Wno-bool-compare])
saved_flags="$CFLAGS"
CFLAGS="$CFLAGS -Wbool-compare"
AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [])], [
NO_BOOL_COMPARE=-Wno-bool-compare
AC_MSG_RESULT([yes])
], [
NO_BOOL_COMPARE=
AC_MSG_RESULT([no])
])
CFLAGS="$saved_flags"
AC_SUBST([NO_BOOL_COMPARE])
])
dnl #
dnl # Check if gcc supports -Wno-unused-but-set-variable option.
dnl #
dnl # We actually invoke gcc with the -Wunused-but-set-variable option
dnl # and infer the 'no-' version does or doesn't exist based upon
dnl # the results. This is required because when checking any of
dnl # no- prefixed options gcc always returns success.
dnl #
AC_DEFUN([ZFS_AC_CONFIG_ALWAYS_CC_NO_UNUSED_BUT_SET_VARIABLE], [
AC_MSG_CHECKING([whether $CC supports -Wno-unused-but-set-variable])
saved_flags="$CFLAGS"
CFLAGS="$CFLAGS -Wunused-but-set-variable"
AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [])], [
NO_UNUSED_BUT_SET_VARIABLE=-Wno-unused-but-set-variable
AC_MSG_RESULT([yes])
], [
NO_UNUSED_BUT_SET_VARIABLE=
AC_MSG_RESULT([no])
])
CFLAGS="$saved_flags"
AC_SUBST([NO_UNUSED_BUT_SET_VARIABLE])
])
dnl #
dnl # Check if gcc supports -fno-omit-frame-pointer option.
dnl #
AC_DEFUN([ZFS_AC_CONFIG_ALWAYS_CC_NO_OMIT_FRAME_POINTER], [
AC_MSG_CHECKING([whether $CC supports -fno-omit-frame-pointer])
saved_flags="$CFLAGS"
CFLAGS="$CFLAGS -fno-omit-frame-pointer"
AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [])], [
NO_OMIT_FRAME_POINTER=-fno-omit-frame-pointer
AC_MSG_RESULT([yes])
], [
NO_OMIT_FRAME_POINTER=
AC_MSG_RESULT([no])
])
CFLAGS="$saved_flags"
AC_SUBST([NO_OMIT_FRAME_POINTER])
])
-27
View File
@@ -1,27 +0,0 @@
dnl #
dnl # Check if gcc supports -Wno-bool-compare option.
dnl #
dnl # We actually invoke gcc with the -Wbool-compare option
dnl # and infer the 'no-' version does or doesn't exist based upon
dnl # the results. This is required because when checking any of
dnl # no- prefixed options gcc always returns success.
dnl #
AC_DEFUN([ZFS_AC_CONFIG_ALWAYS_NO_BOOL_COMPARE], [
AC_MSG_CHECKING([for -Wno-bool-compare support])
saved_flags="$CFLAGS"
CFLAGS="$CFLAGS -Wbool-compare"
AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [])],
[
NO_BOOL_COMPARE=-Wno-bool-compare
AC_MSG_RESULT([yes])
],
[
NO_BOOL_COMPARE=
AC_MSG_RESULT([no])
])
CFLAGS="$saved_flags"
AC_SUBST([NO_BOOL_COMPARE])
])
@@ -1,27 +0,0 @@
dnl #
dnl # Check if gcc supports -Wno-unused-but-set-variable option.
dnl #
dnl # We actually invoke gcc with the -Wunused-but-set-variable option
dnl # and infer the 'no-' version does or doesn't exist based upon
dnl # the results. This is required because when checking any of
dnl # no- prefixed options gcc always returns success.
dnl #
AC_DEFUN([ZFS_AC_CONFIG_ALWAYS_NO_UNUSED_BUT_SET_VARIABLE], [
AC_MSG_CHECKING([for -Wno-unused-but-set-variable support])
saved_flags="$CFLAGS"
CFLAGS="$CFLAGS -Wunused-but-set-variable"
AC_COMPILE_IFELSE([AC_LANG_PROGRAM([], [])],
[
NO_UNUSED_BUT_SET_VARIABLE=-Wno-unused-but-set-variable
AC_MSG_RESULT([yes])
],
[
NO_UNUSED_BUT_SET_VARIABLE=
AC_MSG_RESULT([no])
])
CFLAGS="$saved_flags"
AC_SUBST([NO_UNUSED_BUT_SET_VARIABLE])
])
+66
View File
@@ -0,0 +1,66 @@
dnl #
dnl # The majority of the python scripts are written to be compatible
dnl # with Python 2.6 and Python 3.4. Therefore, they may be installed
dnl # and used with either interpreter. This option is intended to
dnl # to provide a method to specify the default system version, and
dnl # set the PYTHON environment variable accordingly.
dnl #
AC_DEFUN([ZFS_AC_CONFIG_ALWAYS_PYTHON], [
AC_ARG_WITH([python],
AC_HELP_STRING([--with-python[=VERSION]],
[default system python version @<:@default=check@:>@]),
[with_python=$withval],
[with_python=check])
AS_CASE([$with_python],
[check], [AC_CHECK_PROGS([PYTHON], [python3 python2], [:])],
[2*], [PYTHON="python${with_python}"],
[*python2*], [PYTHON="${with_python}"],
[3*], [PYTHON="python${with_python}"],
[*python3*], [PYTHON="${with_python}"],
[no], [PYTHON=":"],
[AC_MSG_ERROR([Unknown --with-python value '$with_python'])]
)
dnl #
dnl # Minimum supported Python versions for utilities:
dnl # Python 2.6 or Python 3.4
dnl #
AM_PATH_PYTHON([], [], [:])
AS_IF([test -z "$PYTHON_VERSION"], [
PYTHON_VERSION=$(basename $PYTHON | tr -cd 0-9.)
])
PYTHON_MINOR=${PYTHON_VERSION#*\.}
AS_CASE([$PYTHON_VERSION],
[2.*], [
AS_IF([test $PYTHON_MINOR -lt 6],
[AC_MSG_ERROR("Python >= 2.6 is required")])
],
[3.*], [
AS_IF([test $PYTHON_MINOR -lt 4],
[AC_MSG_ERROR("Python >= 3.4 is required")])
],
[:|2|3], [],
[PYTHON_VERSION=3]
)
AM_CONDITIONAL([USING_PYTHON], [test "$PYTHON" != :])
AM_CONDITIONAL([USING_PYTHON_2], [test "x${PYTHON_VERSION%%\.*}" = x2])
AM_CONDITIONAL([USING_PYTHON_3], [test "x${PYTHON_VERSION%%\.*}" = x3])
dnl #
dnl # Request that packages be built for a specific Python version.
dnl #
AS_IF([test "x$with_python" != xcheck], [
PYTHON_PKG_VERSION=$(echo $PYTHON_VERSION | tr -d .)
DEFINE_PYTHON_PKG_VERSION='--define "__use_python_pkg_version '${PYTHON_PKG_VERSION}'"'
DEFINE_PYTHON_VERSION='--define "__use_python '${PYTHON}'"'
], [
DEFINE_PYTHON_VERSION=''
DEFINE_PYTHON_PKG_VERSION=''
])
AC_SUBST(DEFINE_PYTHON_VERSION)
AC_SUBST(DEFINE_PYTHON_PKG_VERSION)
])
+105
View File
@@ -0,0 +1,105 @@
dnl #
dnl # ZFS_AC_PYTHON_MODULE(module_name, [action-if-true], [action-if-false])
dnl #
dnl # Checks for Python module. Freely inspired by AX_PYTHON_MODULE
dnl # https://www.gnu.org/software/autoconf-archive/ax_python_module.html
dnl # Required by ZFS_AC_CONFIG_ALWAYS_PYZFS.
dnl #
AC_DEFUN([ZFS_AC_PYTHON_MODULE], [
PYTHON_NAME=$(basename $PYTHON)
AC_MSG_CHECKING([for $PYTHON_NAME module: $1])
AS_IF([$PYTHON -c "import $1" 2>/dev/null], [
AC_MSG_RESULT(yes)
m4_ifvaln([$2], [$2])
], [
AC_MSG_RESULT(no)
m4_ifvaln([$3], [$3])
])
])
dnl #
dnl # Determines if pyzfs can be built, requires Python 2.7 or later.
dnl #
AC_DEFUN([ZFS_AC_CONFIG_ALWAYS_PYZFS], [
AC_ARG_ENABLE([pyzfs],
AC_HELP_STRING([--enable-pyzfs],
[install libzfs_core python bindings @<:@default=check@:>@]),
[enable_pyzfs=$enableval],
[enable_pyzfs=check])
dnl #
dnl # Packages for pyzfs specifically enabled/disabled.
dnl #
AS_IF([test "x$enable_pyzfs" != xcheck], [
AS_IF([test "x$enable_pyzfs" = xyes], [
DEFINE_PYZFS='--with pyzfs'
], [
DEFINE_PYZFS='--without pyzfs'
])
], [
AS_IF([test "$PYTHON" != :], [
DEFINE_PYZFS=''
], [
enable_pyzfs=no
DEFINE_PYZFS='--without pyzfs'
])
])
AC_SUBST(DEFINE_PYZFS)
dnl #
dnl # Require python-devel libraries
dnl #
AS_IF([test "x$enable_pyzfs" = xcheck -o "x$enable_pyzfs" = xyes], [
AS_CASE([$PYTHON_VERSION],
[3.*], [PYTHON_REQUIRED_VERSION=">= '3.4.0'"],
[2.*], [PYTHON_REQUIRED_VERSION=">= '2.7.0'"],
[AC_MSG_ERROR("Python $PYTHON_VERSION unknown")]
)
AX_PYTHON_DEVEL([$PYTHON_REQUIRED_VERSION], [
AS_IF([test "x$enable_pyzfs" = xyes], [
AC_MSG_ERROR("Python $PYTHON_REQUIRED_VERSION development library is not installed")
], [test "x$enable_pyzfs" != xno], [
enable_pyzfs=no
])
])
])
dnl #
dnl # Python "setuptools" module is required to build and install pyzfs
dnl #
AS_IF([test "x$enable_pyzfs" = xcheck -o "x$enable_pyzfs" = xyes], [
ZFS_AC_PYTHON_MODULE([setuptools], [], [
AS_IF([test "x$enable_pyzfs" = xyes], [
AC_MSG_ERROR("Python $PYTHON_VERSION setuptools is not installed")
], [test "x$enable_pyzfs" != xno], [
enable_pyzfs=no
])
])
])
dnl #
dnl # Python "cffi" module is required to run pyzfs
dnl #
AS_IF([test "x$enable_pyzfs" = xcheck -o "x$enable_pyzfs" = xyes], [
ZFS_AC_PYTHON_MODULE([cffi], [], [
AS_IF([test "x$enable_pyzfs" = xyes], [
AC_MSG_ERROR("Python $PYTHON_VERSION cffi is not installed")
], [test "x$enable_pyzfs" != xno], [
enable_pyzfs=no
])
])
])
dnl #
dnl # Set enable_pyzfs to 'yes' if every check passed
dnl #
AS_IF([test "x$enable_pyzfs" = xcheck], [enable_pyzfs=yes])
AM_CONDITIONAL([PYZFS_ENABLED], [test "x$enable_pyzfs" = xyes])
AC_SUBST([PYZFS_ENABLED], [$enable_pyzfs])
AC_SUBST(pythonsitedir, [$PYTHON_SITE_PKG])
AC_MSG_CHECKING([whether to enable pyzfs: ])
AC_MSG_RESULT($enable_pyzfs)
])
+345
View File
@@ -0,0 +1,345 @@
# ===========================================================================
# https://www.gnu.org/software/autoconf-archive/ax_python_devel.html
# ===========================================================================
#
# SYNOPSIS
#
# AX_PYTHON_DEVEL([version], [action-if-not-found])
#
# DESCRIPTION
#
# Note: Defines as a precious variable "PYTHON_VERSION". Don't override it
# in your configure.ac.
#
# Note: this is a slightly modified version of the original AX_PYTHON_DEVEL
# macro which accepts an additional [action-if-not-found] argument. This
# allow to detect if Python development is available without aborting the
# configure phase with an hard error in case it is not.
#
# This macro checks for Python and tries to get the include path to
# 'Python.h'. It provides the $(PYTHON_CPPFLAGS) and $(PYTHON_LIBS) output
# variables. It also exports $(PYTHON_EXTRA_LIBS) and
# $(PYTHON_EXTRA_LDFLAGS) for embedding Python in your code.
#
# You can search for some particular version of Python by passing a
# parameter to this macro, for example ">= '2.3.1'", or "== '2.4'". Please
# note that you *have* to pass also an operator along with the version to
# match, and pay special attention to the single quotes surrounding the
# version number. Don't use "PYTHON_VERSION" for this: that environment
# variable is declared as precious and thus reserved for the end-user.
#
# This macro should work for all versions of Python >= 2.1.0. As an end
# user, you can disable the check for the python version by setting the
# PYTHON_NOVERSIONCHECK environment variable to something else than the
# empty string.
#
# If you need to use this macro for an older Python version, please
# contact the authors. We're always open for feedback.
#
# LICENSE
#
# Copyright (c) 2009 Sebastian Huber <sebastian-huber@web.de>
# Copyright (c) 2009 Alan W. Irwin
# Copyright (c) 2009 Rafael Laboissiere <rafael@laboissiere.net>
# Copyright (c) 2009 Andrew Collier
# Copyright (c) 2009 Matteo Settenvini <matteo@member.fsf.org>
# Copyright (c) 2009 Horst Knorr <hk_classes@knoda.org>
# Copyright (c) 2013 Daniel Mullner <muellner@math.stanford.edu>
# Copyright (c) 2018 loli10K <ezomori.nozomu@gmail.com>
#
# This program is free software: you can redistribute it and/or modify it
# under the terms of the GNU General Public License as published by the
# Free Software Foundation, either version 3 of the License, or (at your
# option) any later version.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
# Public License for more details.
#
# You should have received a copy of the GNU General Public License along
# with this program. If not, see <https://www.gnu.org/licenses/>.
#
# As a special exception, the respective Autoconf Macro's copyright owner
# gives unlimited permission to copy, distribute and modify the configure
# scripts that are the output of Autoconf when processing the Macro. You
# need not follow the terms of the GNU General Public License when using
# or distributing such scripts, even though portions of the text of the
# Macro appear in them. The GNU General Public License (GPL) does govern
# all other use of the material that constitutes the Autoconf Macro.
#
# This special exception to the GPL applies to versions of the Autoconf
# Macro released by the Autoconf Archive. When you make and distribute a
# modified version of the Autoconf Macro, you may extend this special
# exception to the GPL to apply to your modified version as well.
#serial 21
AU_ALIAS([AC_PYTHON_DEVEL], [AX_PYTHON_DEVEL])
AC_DEFUN([AX_PYTHON_DEVEL],[
#
# Allow the use of a (user set) custom python version
#
AC_ARG_VAR([PYTHON_VERSION],[The installed Python
version to use, for example '2.3'. This string
will be appended to the Python interpreter
canonical name.])
AC_PATH_PROG([PYTHON],[python[$PYTHON_VERSION]])
if test -z "$PYTHON"; then
m4_ifvaln([$2],[$2],[
AC_MSG_ERROR([Cannot find python$PYTHON_VERSION in your system path])
PYTHON_VERSION=""
])
fi
#
# Check for a version of Python >= 2.1.0
#
AC_MSG_CHECKING([for a version of Python >= '2.1.0'])
ac_supports_python_ver=`$PYTHON -c "import sys; \
ver = sys.version.split ()[[0]]; \
print (ver >= '2.1.0')"`
if test "$ac_supports_python_ver" != "True"; then
if test -z "$PYTHON_NOVERSIONCHECK"; then
AC_MSG_RESULT([no])
m4_ifvaln([$2],[$2],[
AC_MSG_FAILURE([
This version of the AC@&t@_PYTHON_DEVEL macro
doesn't work properly with versions of Python before
2.1.0. You may need to re-run configure, setting the
variables PYTHON_CPPFLAGS, PYTHON_LIBS, PYTHON_SITE_PKG,
PYTHON_EXTRA_LIBS and PYTHON_EXTRA_LDFLAGS by hand.
Moreover, to disable this check, set PYTHON_NOVERSIONCHECK
to something else than an empty string.
])
])
else
AC_MSG_RESULT([skip at user request])
fi
else
AC_MSG_RESULT([yes])
fi
#
# if the macro parameter ``version'' is set, honour it
#
if test -n "$1"; then
AC_MSG_CHECKING([for a version of Python $1])
ac_supports_python_ver=`$PYTHON -c "import sys; \
ver = sys.version.split ()[[0]]; \
print (ver $1)"`
if test "$ac_supports_python_ver" = "True"; then
AC_MSG_RESULT([yes])
else
AC_MSG_RESULT([no])
m4_ifvaln([$2],[$2],[
AC_MSG_ERROR([this package requires Python $1.
If you have it installed, but it isn't the default Python
interpreter in your system path, please pass the PYTHON_VERSION
variable to configure. See ``configure --help'' for reference.
])
PYTHON_VERSION=""
])
fi
fi
#
# Check if you have distutils, else fail
#
AC_MSG_CHECKING([for the distutils Python package])
ac_distutils_result=`$PYTHON -c "import distutils" 2>&1`
if test $? -eq 0; then
AC_MSG_RESULT([yes])
else
AC_MSG_RESULT([no])
m4_ifvaln([$2],[$2],[
AC_MSG_ERROR([cannot import Python module "distutils".
Please check your Python installation. The error was:
$ac_distutils_result])
PYTHON_VERSION=""
])
fi
#
# Check for Python include path
#
AC_MSG_CHECKING([for Python include path])
if test -z "$PYTHON_CPPFLAGS"; then
python_path=`$PYTHON -c "import distutils.sysconfig; \
print (distutils.sysconfig.get_python_inc ());"`
plat_python_path=`$PYTHON -c "import distutils.sysconfig; \
print (distutils.sysconfig.get_python_inc (plat_specific=1));"`
if test -n "${python_path}"; then
if test "${plat_python_path}" != "${python_path}"; then
python_path="-I$python_path -I$plat_python_path"
else
python_path="-I$python_path"
fi
fi
PYTHON_CPPFLAGS=$python_path
fi
AC_MSG_RESULT([$PYTHON_CPPFLAGS])
AC_SUBST([PYTHON_CPPFLAGS])
#
# Check for Python library path
#
AC_MSG_CHECKING([for Python library path])
if test -z "$PYTHON_LIBS"; then
# (makes two attempts to ensure we've got a version number
# from the interpreter)
ac_python_version=`cat<<EOD | $PYTHON -
# join all versioning strings, on some systems
# major/minor numbers could be in different list elements
from distutils.sysconfig import *
e = get_config_var('VERSION')
if e is not None:
print(e)
EOD`
if test -z "$ac_python_version"; then
if test -n "$PYTHON_VERSION"; then
ac_python_version=$PYTHON_VERSION
else
ac_python_version=`$PYTHON -c "import sys; \
print (sys.version[[:3]])"`
fi
fi
# Make the versioning information available to the compiler
AC_DEFINE_UNQUOTED([HAVE_PYTHON], ["$ac_python_version"],
[If available, contains the Python version number currently in use.])
# First, the library directory:
ac_python_libdir=`cat<<EOD | $PYTHON -
# There should be only one
import distutils.sysconfig
e = distutils.sysconfig.get_config_var('LIBDIR')
if e is not None:
print (e)
EOD`
# Now, for the library:
ac_python_library=`cat<<EOD | $PYTHON -
import distutils.sysconfig
c = distutils.sysconfig.get_config_vars()
if 'LDVERSION' in c:
print ('python'+c[['LDVERSION']])
else:
print ('python'+c[['VERSION']])
EOD`
# This small piece shamelessly adapted from PostgreSQL python macro;
# credits goes to momjian, I think. I'd like to put the right name
# in the credits, if someone can point me in the right direction... ?
#
if test -n "$ac_python_libdir" -a -n "$ac_python_library"
then
# use the official shared library
ac_python_library=`echo "$ac_python_library" | sed "s/^lib//"`
PYTHON_LIBS="-L$ac_python_libdir -l$ac_python_library"
else
# old way: use libpython from python_configdir
ac_python_libdir=`$PYTHON -c \
"from distutils.sysconfig import get_python_lib as f; \
import os; \
print (os.path.join(f(plat_specific=1, standard_lib=1), 'config'));"`
PYTHON_LIBS="-L$ac_python_libdir -lpython$ac_python_version"
fi
if test -z "PYTHON_LIBS"; then
m4_ifvaln([$2],[$2],[
AC_MSG_ERROR([
Cannot determine location of your Python DSO. Please check it was installed with
dynamic libraries enabled, or try setting PYTHON_LIBS by hand.
])
])
fi
fi
AC_MSG_RESULT([$PYTHON_LIBS])
AC_SUBST([PYTHON_LIBS])
#
# Check for site packages
#
AC_MSG_CHECKING([for Python site-packages path])
if test -z "$PYTHON_SITE_PKG"; then
PYTHON_SITE_PKG=`$PYTHON -c "import distutils.sysconfig; \
print (distutils.sysconfig.get_python_lib(0,0));"`
fi
AC_MSG_RESULT([$PYTHON_SITE_PKG])
AC_SUBST([PYTHON_SITE_PKG])
#
# libraries which must be linked in when embedding
#
AC_MSG_CHECKING(python extra libraries)
if test -z "$PYTHON_EXTRA_LIBS"; then
PYTHON_EXTRA_LIBS=`$PYTHON -c "import distutils.sysconfig; \
conf = distutils.sysconfig.get_config_var; \
print (conf('LIBS') + ' ' + conf('SYSLIBS'))"`
fi
AC_MSG_RESULT([$PYTHON_EXTRA_LIBS])
AC_SUBST(PYTHON_EXTRA_LIBS)
#
# linking flags needed when embedding
#
AC_MSG_CHECKING(python extra linking flags)
if test -z "$PYTHON_EXTRA_LDFLAGS"; then
PYTHON_EXTRA_LDFLAGS=`$PYTHON -c "import distutils.sysconfig; \
conf = distutils.sysconfig.get_config_var; \
print (conf('LINKFORSHARED'))"`
fi
AC_MSG_RESULT([$PYTHON_EXTRA_LDFLAGS])
AC_SUBST(PYTHON_EXTRA_LDFLAGS)
#
# final check to see if everything compiles alright
#
AC_MSG_CHECKING([consistency of all components of python development environment])
# save current global flags
ac_save_LIBS="$LIBS"
ac_save_LDFLAGS="$LDFLAGS"
ac_save_CPPFLAGS="$CPPFLAGS"
LIBS="$ac_save_LIBS $PYTHON_LIBS $PYTHON_EXTRA_LIBS $PYTHON_EXTRA_LIBS"
LDFLAGS="$ac_save_LDFLAGS $PYTHON_EXTRA_LDFLAGS"
CPPFLAGS="$ac_save_CPPFLAGS $PYTHON_CPPFLAGS"
AC_LANG_PUSH([C])
AC_LINK_IFELSE([
AC_LANG_PROGRAM([[#include <Python.h>]],
[[Py_Initialize();]])
],[pythonexists=yes],[pythonexists=no])
AC_LANG_POP([C])
# turn back to default flags
CPPFLAGS="$ac_save_CPPFLAGS"
LIBS="$ac_save_LIBS"
LDFLAGS="$ac_save_LDFLAGS"
AC_MSG_RESULT([$pythonexists])
if test ! "x$pythonexists" = "xyes"; then
m4_ifvaln([$2],[$2],[
AC_MSG_FAILURE([
Could not link test program to Python. Maybe the main Python library has been
installed in some non-standard library path. If so, pass it to configure,
via the LIBS environment variable.
Example: ./configure LIBS="-L/usr/non-standard-path/python/lib"
============================================================================
ERROR!
You probably have to install the development version of the Python package
for your distribution. The exact name of this package varies among them.
============================================================================
])
PYTHON_VERSION=""
])
fi
#
# all done!
#
])
+31
View File
@@ -0,0 +1,31 @@
# ===========================================================================
# http://www.gnu.org/software/autoconf-archive/ax_restore_flags.html
# ===========================================================================
#
# SYNOPSIS
#
# AX_RESTORE_FLAGS()
#
# DESCRIPTION
#
# Restore common compilation flags from temporary variables
#
# LICENSE
#
# Copyright (c) 2009 Filippo Giunchedi <filippo@esaurito.net>
#
# Copying and distribution of this file, with or without modification, are
# permitted in any medium without royalty provided the copyright notice
# and this notice are preserved. This file is offered as-is, without any
# warranty.
#serial 3
AC_DEFUN([AX_RESTORE_FLAGS], [
CPPFLAGS="${CPPFLAGS_save}"
CFLAGS="${CFLAGS_save}"
CXXFLAGS="${CXXFLAGS_save}"
OBJCFLAGS="${OBJCFLAGS_save}"
LDFLAGS="${LDFLAGS_save}"
LIBS="${LIBS_save}"
])
+31
View File
@@ -0,0 +1,31 @@
# ===========================================================================
# http://www.gnu.org/software/autoconf-archive/ax_save_flags.html
# ===========================================================================
#
# SYNOPSIS
#
# AX_SAVE_FLAGS()
#
# DESCRIPTION
#
# Save common compilation flags into temporary variables
#
# LICENSE
#
# Copyright (c) 2009 Filippo Giunchedi <filippo@esaurito.net>
#
# Copying and distribution of this file, with or without modification, are
# permitted in any medium without royalty provided the copyright notice
# and this notice are preserved. This file is offered as-is, without any
# warranty.
#serial 3
AC_DEFUN([AX_SAVE_FLAGS], [
CPPFLAGS_save="${CPPFLAGS}"
CFLAGS_save="${CFLAGS}"
CXXFLAGS_save="${CXXFLAGS}"
OBJCFLAGS_save="${OBJCFLAGS}"
LDFLAGS_save="${LDFLAGS}"
LIBS_save="${LIBS}"
])
+1
View File
@@ -0,0 +1 @@
# `make distclean` deletes files with size 0. This text is to avoid that.
+6 -5
View File
@@ -20,7 +20,7 @@ deb-kmod: deb-local rpm-kmod
arch=`$(RPM) -qp $${name}-kmod-$${version}.src.rpm --qf %{arch} | tail -1`; \
debarch=`$(DPKG) --print-architecture`; \
pkg1=kmod-$${name}*$${version}.$${arch}.rpm; \
fakeroot $(ALIEN) --bump=0 --scripts --to-deb --target=$$debarch $$pkg1; \
fakeroot $(ALIEN) --bump=0 --scripts --to-deb --target=$$debarch $$pkg1 || exit 1; \
$(RM) $$pkg1
@@ -30,7 +30,7 @@ deb-dkms: deb-local rpm-dkms
arch=`$(RPM) -qp $${name}-dkms-$${version}.src.rpm --qf %{arch} | tail -1`; \
debarch=`$(DPKG) --print-architecture`; \
pkg1=$${name}-dkms-$${version}.$${arch}.rpm; \
fakeroot $(ALIEN) --bump=0 --scripts --to-deb --target=$$debarch $$pkg1; \
fakeroot $(ALIEN) --bump=0 --scripts --to-deb --target=$$debarch $$pkg1 || exit 1; \
$(RM) $$pkg1
deb-utils: deb-local rpm-utils
@@ -45,8 +45,9 @@ deb-utils: deb-local rpm-utils
pkg5=libzpool2-$${version}.$${arch}.rpm; \
pkg6=libzfs2-devel-$${version}.$${arch}.rpm; \
pkg7=$${name}-test-$${version}.$${arch}.rpm; \
pkg8=$${name}-dracut-$${version}.$${arch}.rpm; \
pkg8=$${name}-dracut-$${version}.noarch.rpm; \
pkg9=$${name}-initramfs-$${version}.$${arch}.rpm; \
pkg10=`ls python*-pyzfs-$${version}* | tail -1`; \
## Arguments need to be passed to dh_shlibdeps. Alien provides no mechanism
## to do this, so we install a shim onto the path which calls the real
## dh_shlibdeps with the required arguments.
@@ -62,10 +63,10 @@ deb-utils: deb-local rpm-utils
env PATH=$${path_prepend}:$${PATH} \
fakeroot $(ALIEN) --bump=0 --scripts --to-deb --target=$$debarch \
$$pkg1 $$pkg2 $$pkg3 $$pkg4 $$pkg5 $$pkg6 $$pkg7 \
$$pkg8 $$pkg9; \
$$pkg8 $$pkg9 $$pkg10 || exit 1; \
$(RM) $${path_prepend}/dh_shlibdeps; \
rmdir $${path_prepend}; \
$(RM) $$pkg1 $$pkg2 $$pkg3 $$pkg4 $$pkg5 $$pkg6 $$pkg7 \
$$pkg8 $$pkg9;
$$pkg8 $$pkg9 $$pkg10;
deb: deb-kmod deb-dkms deb-utils
+73
View File
@@ -0,0 +1,73 @@
# find_system_lib.m4 - Macros to search for a system library. -*- Autoconf -*-
dnl requires pkg.m4 from pkg-config
dnl requires ax_save_flags.m4 from autoconf-archive
dnl requires ax_restore_flags.m4 from autoconf-archive
dnl FIND_SYSTEM_LIBRARY(VARIABLE-PREFIX, MODULE, HEADER, HEADER-PREFIXES, LIBRARY, FUNCTIONS, [ACTION-IF-FOUND], [ACTION-IF-NOT-FOUND])
AC_DEFUN([FIND_SYSTEM_LIBRARY], [
AC_REQUIRE([PKG_PROG_PKG_CONFIG])
_library_found=
PKG_CHECK_MODULES([$1], [$2], [_library_found=1], [
AS_IF([test -f /usr/include/[$3]], [
AC_SUBST([$1][_CFLAGS], [])
AC_SUBST([$1][_LIBS], ["-l[$5]]")
_library_found=1
],[ AS_IF([test -f /usr/local/include/[$3]], [
AC_SUBST([$1][_CFLAGS], ["-I/usr/local/include"])
AC_SUBST([$1][_LIBS], ["-L/usr/local -l[$5]]")
_library_found=1
],[dnl ELSE
m4_foreach([prefix], [$4], [
AS_IF([test "x$_library_found" != "x1"], [
AS_IF([test -f [/usr/include/]prefix[/][$3]], [
AC_SUBST([$1][_CFLAGS], ["[-I/usr/include/]prefix["]])
AC_SUBST([$1][_LIBS], ["-l[$5]]")
_library_found=1
],[ AS_IF([test -f [/usr/local/include/]prefix[/][$3]], [
AC_SUBST([$1][_CFLAGS], ["[-I/usr/local/include/]prefix["]])
AC_SUBST([$1][_LIBS], ["-L/usr/local -l[$5]"])
_library_found=1
])])
])
])
])])
AS_IF([test -z "$_library_found"], [
AC_MSG_WARN([cannot find [$2] via pkg-config or in the standard locations])
])
])
dnl do some further sanity checks
AS_IF([test -n "$_library_found"], [
AX_SAVE_FLAGS
CPPFLAGS="$CPPFLAGS $(echo $[$1][_CFLAGS] | sed 's/-include */-include-/g; s/^/ /; s/ [^-][^ ]*//g; s/ -[^Ii][^ ]*//g; s/-include-/-include /g; s/^ //;')"
CFLAGS="$CFLAGS $[$1][_CFLAGS]"
LDFLAGS="$LDFLAGS $[$1][_LIBS]"
AC_CHECK_HEADER([$3], [], [
AC_MSG_WARN([header [$3] for library [$2] is not usable])
_library_found=
])
m4_foreach([func], [$6], [
AC_CHECK_LIB([$5], func, [], [
AC_MSG_WARN([cannot find ]func[ in library [$5]])
_library_found=
])
])
AX_RESTORE_FLAGS
])
AS_IF([test -n "$_library_found"], [
:;$7
],[dnl ELSE
:;$8
])
])
+386
View File
@@ -0,0 +1,386 @@
# gettext.m4 serial 70 (gettext-0.20)
dnl Copyright (C) 1995-2014, 2016, 2018 Free Software Foundation, Inc.
dnl This file is free software; the Free Software Foundation
dnl gives unlimited permission to copy and/or distribute it,
dnl with or without modifications, as long as this notice is preserved.
dnl
dnl This file can be used in projects which are not available under
dnl the GNU General Public License or the GNU Library General Public
dnl License but which still want to provide support for the GNU gettext
dnl functionality.
dnl Please note that the actual code of the GNU gettext library is covered
dnl by the GNU Library General Public License, and the rest of the GNU
dnl gettext package is covered by the GNU General Public License.
dnl They are *not* in the public domain.
dnl Authors:
dnl Ulrich Drepper <drepper@cygnus.com>, 1995-2000.
dnl Bruno Haible <haible@clisp.cons.org>, 2000-2006, 2008-2010.
dnl Macro to add for using GNU gettext.
dnl Usage: AM_GNU_GETTEXT([INTLSYMBOL], [NEEDSYMBOL], [INTLDIR]).
dnl INTLSYMBOL must be one of 'external', 'use-libtool'.
dnl INTLSYMBOL should be 'external' for packages other than GNU gettext, and
dnl 'use-libtool' for the packages 'gettext-runtime' and 'gettext-tools'.
dnl If INTLSYMBOL is 'use-libtool', then a libtool library
dnl $(top_builddir)/intl/libintl.la will be created (shared and/or static,
dnl depending on --{enable,disable}-{shared,static} and on the presence of
dnl AM-DISABLE-SHARED).
dnl If NEEDSYMBOL is specified and is 'need-ngettext', then GNU gettext
dnl implementations (in libc or libintl) without the ngettext() function
dnl will be ignored. If NEEDSYMBOL is specified and is
dnl 'need-formatstring-macros', then GNU gettext implementations that don't
dnl support the ISO C 99 <inttypes.h> formatstring macros will be ignored.
dnl INTLDIR is used to find the intl libraries. If empty,
dnl the value '$(top_builddir)/intl/' is used.
dnl
dnl The result of the configuration is one of three cases:
dnl 1) GNU gettext, as included in the intl subdirectory, will be compiled
dnl and used.
dnl Catalog format: GNU --> install in $(datadir)
dnl Catalog extension: .mo after installation, .gmo in source tree
dnl 2) GNU gettext has been found in the system's C library.
dnl Catalog format: GNU --> install in $(datadir)
dnl Catalog extension: .mo after installation, .gmo in source tree
dnl 3) No internationalization, always use English msgid.
dnl Catalog format: none
dnl Catalog extension: none
dnl If INTLSYMBOL is 'external', only cases 2 and 3 can occur.
dnl The use of .gmo is historical (it was needed to avoid overwriting the
dnl GNU format catalogs when building on a platform with an X/Open gettext),
dnl but we keep it in order not to force irrelevant filename changes on the
dnl maintainers.
dnl
AC_DEFUN([AM_GNU_GETTEXT],
[
dnl Argument checking.
ifelse([$1], [], , [ifelse([$1], [external], , [ifelse([$1], [use-libtool], ,
[errprint([ERROR: invalid first argument to AM_GNU_GETTEXT
])])])])
ifelse(ifelse([$1], [], [old])[]ifelse([$1], [no-libtool], [old]), [old],
[errprint([ERROR: Use of AM_GNU_GETTEXT without [external] argument is no longer supported.
])])
ifelse([$2], [], , [ifelse([$2], [need-ngettext], , [ifelse([$2], [need-formatstring-macros], ,
[errprint([ERROR: invalid second argument to AM_GNU_GETTEXT
])])])])
define([gt_included_intl],
ifelse([$1], [external], [no], [yes]))
gt_NEEDS_INIT
AM_GNU_GETTEXT_NEED([$2])
AC_REQUIRE([AM_PO_SUBDIRS])dnl
ifelse(gt_included_intl, yes, [
AC_REQUIRE([AM_INTL_SUBDIR])dnl
])
dnl Prerequisites of AC_LIB_LINKFLAGS_BODY.
AC_REQUIRE([AC_LIB_PREPARE_PREFIX])
AC_REQUIRE([AC_LIB_RPATH])
dnl Sometimes libintl requires libiconv, so first search for libiconv.
dnl Ideally we would do this search only after the
dnl if test "$USE_NLS" = "yes"; then
dnl if { eval "gt_val=\$$gt_func_gnugettext_libc"; test "$gt_val" != "yes"; }; then
dnl tests. But if configure.in invokes AM_ICONV after AM_GNU_GETTEXT
dnl the configure script would need to contain the same shell code
dnl again, outside any 'if'. There are two solutions:
dnl - Invoke AM_ICONV_LINKFLAGS_BODY here, outside any 'if'.
dnl - Control the expansions in more detail using AC_PROVIDE_IFELSE.
dnl Since AC_PROVIDE_IFELSE is not documented, we avoid it.
ifelse(gt_included_intl, yes, , [
AC_REQUIRE([AM_ICONV_LINKFLAGS_BODY])
])
dnl Sometimes, on Mac OS X, libintl requires linking with CoreFoundation.
gt_INTL_MACOSX
dnl Set USE_NLS.
AC_REQUIRE([AM_NLS])
ifelse(gt_included_intl, yes, [
BUILD_INCLUDED_LIBINTL=no
USE_INCLUDED_LIBINTL=no
])
LIBINTL=
LTLIBINTL=
POSUB=
dnl Add a version number to the cache macros.
case " $gt_needs " in
*" need-formatstring-macros "*) gt_api_version=3 ;;
*" need-ngettext "*) gt_api_version=2 ;;
*) gt_api_version=1 ;;
esac
gt_func_gnugettext_libc="gt_cv_func_gnugettext${gt_api_version}_libc"
gt_func_gnugettext_libintl="gt_cv_func_gnugettext${gt_api_version}_libintl"
dnl If we use NLS figure out what method
if test "$USE_NLS" = "yes"; then
gt_use_preinstalled_gnugettext=no
ifelse(gt_included_intl, yes, [
AC_MSG_CHECKING([whether included gettext is requested])
AC_ARG_WITH([included-gettext],
[ --with-included-gettext use the GNU gettext library included here],
nls_cv_force_use_gnu_gettext=$withval,
nls_cv_force_use_gnu_gettext=no)
AC_MSG_RESULT([$nls_cv_force_use_gnu_gettext])
nls_cv_use_gnu_gettext="$nls_cv_force_use_gnu_gettext"
if test "$nls_cv_force_use_gnu_gettext" != "yes"; then
])
dnl User does not insist on using GNU NLS library. Figure out what
dnl to use. If GNU gettext is available we use this. Else we have
dnl to fall back to GNU NLS library.
if test $gt_api_version -ge 3; then
gt_revision_test_code='
#ifndef __GNU_GETTEXT_SUPPORTED_REVISION
#define __GNU_GETTEXT_SUPPORTED_REVISION(major) ((major) == 0 ? 0 : -1)
#endif
changequote(,)dnl
typedef int array [2 * (__GNU_GETTEXT_SUPPORTED_REVISION(0) >= 1) - 1];
changequote([,])dnl
'
else
gt_revision_test_code=
fi
if test $gt_api_version -ge 2; then
gt_expression_test_code=' + * ngettext ("", "", 0)'
else
gt_expression_test_code=
fi
AC_CACHE_CHECK([for GNU gettext in libc], [$gt_func_gnugettext_libc],
[AC_LINK_IFELSE(
[AC_LANG_PROGRAM(
[[
#include <libintl.h>
#ifndef __GNU_GETTEXT_SUPPORTED_REVISION
extern int _nl_msg_cat_cntr;
extern int *_nl_domain_bindings;
#define __GNU_GETTEXT_SYMBOL_EXPRESSION (_nl_msg_cat_cntr + *_nl_domain_bindings)
#else
#define __GNU_GETTEXT_SYMBOL_EXPRESSION 0
#endif
$gt_revision_test_code
]],
[[
bindtextdomain ("", "");
return * gettext ("")$gt_expression_test_code + __GNU_GETTEXT_SYMBOL_EXPRESSION
]])],
[eval "$gt_func_gnugettext_libc=yes"],
[eval "$gt_func_gnugettext_libc=no"])])
if { eval "gt_val=\$$gt_func_gnugettext_libc"; test "$gt_val" != "yes"; }; then
dnl Sometimes libintl requires libiconv, so first search for libiconv.
ifelse(gt_included_intl, yes, , [
AM_ICONV_LINK
])
dnl Search for libintl and define LIBINTL, LTLIBINTL and INCINTL
dnl accordingly. Don't use AC_LIB_LINKFLAGS_BODY([intl],[iconv])
dnl because that would add "-liconv" to LIBINTL and LTLIBINTL
dnl even if libiconv doesn't exist.
AC_LIB_LINKFLAGS_BODY([intl])
AC_CACHE_CHECK([for GNU gettext in libintl],
[$gt_func_gnugettext_libintl],
[gt_save_CPPFLAGS="$CPPFLAGS"
CPPFLAGS="$CPPFLAGS $INCINTL"
gt_save_LIBS="$LIBS"
LIBS="$LIBS $LIBINTL"
dnl Now see whether libintl exists and does not depend on libiconv.
AC_LINK_IFELSE(
[AC_LANG_PROGRAM(
[[
#include <libintl.h>
#ifndef __GNU_GETTEXT_SUPPORTED_REVISION
extern int _nl_msg_cat_cntr;
extern
#ifdef __cplusplus
"C"
#endif
const char *_nl_expand_alias (const char *);
#define __GNU_GETTEXT_SYMBOL_EXPRESSION (_nl_msg_cat_cntr + *_nl_expand_alias (""))
#else
#define __GNU_GETTEXT_SYMBOL_EXPRESSION 0
#endif
$gt_revision_test_code
]],
[[
bindtextdomain ("", "");
return * gettext ("")$gt_expression_test_code + __GNU_GETTEXT_SYMBOL_EXPRESSION
]])],
[eval "$gt_func_gnugettext_libintl=yes"],
[eval "$gt_func_gnugettext_libintl=no"])
dnl Now see whether libintl exists and depends on libiconv.
if { eval "gt_val=\$$gt_func_gnugettext_libintl"; test "$gt_val" != yes; } && test -n "$LIBICONV"; then
LIBS="$LIBS $LIBICONV"
AC_LINK_IFELSE(
[AC_LANG_PROGRAM(
[[
#include <libintl.h>
#ifndef __GNU_GETTEXT_SUPPORTED_REVISION
extern int _nl_msg_cat_cntr;
extern
#ifdef __cplusplus
"C"
#endif
const char *_nl_expand_alias (const char *);
#define __GNU_GETTEXT_SYMBOL_EXPRESSION (_nl_msg_cat_cntr + *_nl_expand_alias (""))
#else
#define __GNU_GETTEXT_SYMBOL_EXPRESSION 0
#endif
$gt_revision_test_code
]],
[[
bindtextdomain ("", "");
return * gettext ("")$gt_expression_test_code + __GNU_GETTEXT_SYMBOL_EXPRESSION
]])],
[LIBINTL="$LIBINTL $LIBICONV"
LTLIBINTL="$LTLIBINTL $LTLIBICONV"
eval "$gt_func_gnugettext_libintl=yes"
])
fi
CPPFLAGS="$gt_save_CPPFLAGS"
LIBS="$gt_save_LIBS"])
fi
dnl If an already present or preinstalled GNU gettext() is found,
dnl use it. But if this macro is used in GNU gettext, and GNU
dnl gettext is already preinstalled in libintl, we update this
dnl libintl. (Cf. the install rule in intl/Makefile.in.)
if { eval "gt_val=\$$gt_func_gnugettext_libc"; test "$gt_val" = "yes"; } \
|| { { eval "gt_val=\$$gt_func_gnugettext_libintl"; test "$gt_val" = "yes"; } \
&& test "$PACKAGE" != gettext-runtime \
&& test "$PACKAGE" != gettext-tools; }; then
gt_use_preinstalled_gnugettext=yes
else
dnl Reset the values set by searching for libintl.
LIBINTL=
LTLIBINTL=
INCINTL=
fi
ifelse(gt_included_intl, yes, [
if test "$gt_use_preinstalled_gnugettext" != "yes"; then
dnl GNU gettext is not found in the C library.
dnl Fall back on included GNU gettext library.
nls_cv_use_gnu_gettext=yes
fi
fi
if test "$nls_cv_use_gnu_gettext" = "yes"; then
dnl Mark actions used to generate GNU NLS library.
BUILD_INCLUDED_LIBINTL=yes
USE_INCLUDED_LIBINTL=yes
LIBINTL="ifelse([$3],[],\${top_builddir}/intl,[$3])/libintl.la $LIBICONV $LIBTHREAD"
LTLIBINTL="ifelse([$3],[],\${top_builddir}/intl,[$3])/libintl.la $LTLIBICONV $LTLIBTHREAD"
LIBS=`echo " $LIBS " | sed -e 's/ -lintl / /' -e 's/^ //' -e 's/ $//'`
fi
CATOBJEXT=
if test "$gt_use_preinstalled_gnugettext" = "yes" \
|| test "$nls_cv_use_gnu_gettext" = "yes"; then
dnl Mark actions to use GNU gettext tools.
CATOBJEXT=.gmo
fi
])
if test -n "$INTL_MACOSX_LIBS"; then
if test "$gt_use_preinstalled_gnugettext" = "yes" \
|| test "$nls_cv_use_gnu_gettext" = "yes"; then
dnl Some extra flags are needed during linking.
LIBINTL="$LIBINTL $INTL_MACOSX_LIBS"
LTLIBINTL="$LTLIBINTL $INTL_MACOSX_LIBS"
fi
fi
if test "$gt_use_preinstalled_gnugettext" = "yes" \
|| test "$nls_cv_use_gnu_gettext" = "yes"; then
AC_DEFINE([ENABLE_NLS], [1],
[Define to 1 if translation of program messages to the user's native language
is requested.])
else
USE_NLS=no
fi
fi
AC_MSG_CHECKING([whether to use NLS])
AC_MSG_RESULT([$USE_NLS])
if test "$USE_NLS" = "yes"; then
AC_MSG_CHECKING([where the gettext function comes from])
if test "$gt_use_preinstalled_gnugettext" = "yes"; then
if { eval "gt_val=\$$gt_func_gnugettext_libintl"; test "$gt_val" = "yes"; }; then
gt_source="external libintl"
else
gt_source="libc"
fi
else
gt_source="included intl directory"
fi
AC_MSG_RESULT([$gt_source])
fi
if test "$USE_NLS" = "yes"; then
if test "$gt_use_preinstalled_gnugettext" = "yes"; then
if { eval "gt_val=\$$gt_func_gnugettext_libintl"; test "$gt_val" = "yes"; }; then
AC_MSG_CHECKING([how to link with libintl])
AC_MSG_RESULT([$LIBINTL])
AC_LIB_APPENDTOVAR([CPPFLAGS], [$INCINTL])
fi
dnl For backward compatibility. Some packages may be using this.
AC_DEFINE([HAVE_GETTEXT], [1],
[Define if the GNU gettext() function is already present or preinstalled.])
AC_DEFINE([HAVE_DCGETTEXT], [1],
[Define if the GNU dcgettext() function is already present or preinstalled.])
fi
dnl We need to process the po/ directory.
POSUB=po
fi
ifelse(gt_included_intl, yes, [
dnl In GNU gettext we have to set BUILD_INCLUDED_LIBINTL to 'yes'
dnl because some of the testsuite requires it.
BUILD_INCLUDED_LIBINTL=yes
dnl Make all variables we use known to autoconf.
AC_SUBST([BUILD_INCLUDED_LIBINTL])
AC_SUBST([USE_INCLUDED_LIBINTL])
AC_SUBST([CATOBJEXT])
])
dnl For backward compatibility. Some Makefiles may be using this.
INTLLIBS="$LIBINTL"
AC_SUBST([INTLLIBS])
dnl Make all documented variables known to autoconf.
AC_SUBST([LIBINTL])
AC_SUBST([LTLIBINTL])
AC_SUBST([POSUB])
])
dnl gt_NEEDS_INIT ensures that the gt_needs variable is initialized.
m4_define([gt_NEEDS_INIT],
[
m4_divert_text([DEFAULTS], [gt_needs=])
m4_define([gt_NEEDS_INIT], [])
])
dnl Usage: AM_GNU_GETTEXT_NEED([NEEDSYMBOL])
AC_DEFUN([AM_GNU_GETTEXT_NEED],
[
m4_divert_text([INIT_PREPARE], [gt_needs="$gt_needs $1"])
])
dnl Usage: AM_GNU_GETTEXT_VERSION([gettext-version])
AC_DEFUN([AM_GNU_GETTEXT_VERSION], [])
dnl Usage: AM_GNU_GETTEXT_REQUIRE_VERSION([gettext-version])
AC_DEFUN([AM_GNU_GETTEXT_REQUIRE_VERSION], [])
+644
View File
@@ -0,0 +1,644 @@
# host-cpu-c-abi.m4 serial 11
dnl Copyright (C) 2002-2019 Free Software Foundation, Inc.
dnl This file is free software; the Free Software Foundation
dnl gives unlimited permission to copy and/or distribute it,
dnl with or without modifications, as long as this notice is preserved.
dnl From Bruno Haible and Sam Steingold.
dnl Sets the HOST_CPU variable to the canonical name of the CPU.
dnl Sets the HOST_CPU_C_ABI variable to the canonical name of the CPU with its
dnl C language ABI (application binary interface).
dnl Also defines __${HOST_CPU}__ and __${HOST_CPU_C_ABI}__ as C macros in
dnl config.h.
dnl
dnl This canonical name can be used to select a particular assembly language
dnl source file that will interoperate with C code on the given host.
dnl
dnl For example:
dnl * 'i386' and 'sparc' are different canonical names, because code for i386
dnl will not run on SPARC CPUs and vice versa. They have different
dnl instruction sets.
dnl * 'sparc' and 'sparc64' are different canonical names, because code for
dnl 'sparc' and code for 'sparc64' cannot be linked together: 'sparc' code
dnl contains 32-bit instructions, whereas 'sparc64' code contains 64-bit
dnl instructions. A process on a SPARC CPU can be in 32-bit mode or in 64-bit
dnl mode, but not both.
dnl * 'mips' and 'mipsn32' are different canonical names, because they use
dnl different argument passing and return conventions for C functions, and
dnl although the instruction set of 'mips' is a large subset of the
dnl instruction set of 'mipsn32'.
dnl * 'mipsn32' and 'mips64' are different canonical names, because they use
dnl different sizes for the C types like 'int' and 'void *', and although
dnl the instruction sets of 'mipsn32' and 'mips64' are the same.
dnl * The same canonical name is used for different endiannesses. You can
dnl determine the endianness through preprocessor symbols:
dnl - 'arm': test __ARMEL__.
dnl - 'mips', 'mipsn32', 'mips64': test _MIPSEB vs. _MIPSEL.
dnl - 'powerpc64': test _BIG_ENDIAN vs. _LITTLE_ENDIAN.
dnl * The same name 'i386' is used for CPUs of type i386, i486, i586
dnl (Pentium), AMD K7, Pentium II, Pentium IV, etc., because
dnl - Instructions that do not exist on all of these CPUs (cmpxchg,
dnl MMX, SSE, SSE2, 3DNow! etc.) are not frequently used. If your
dnl assembly language source files use such instructions, you will
dnl need to make the distinction.
dnl - Speed of execution of the common instruction set is reasonable across
dnl the entire family of CPUs. If you have assembly language source files
dnl that are optimized for particular CPU types (like GNU gmp has), you
dnl will need to make the distinction.
dnl See <https://en.wikipedia.org/wiki/X86_instruction_listings>.
AC_DEFUN([gl_HOST_CPU_C_ABI],
[
AC_REQUIRE([AC_CANONICAL_HOST])
AC_REQUIRE([gl_C_ASM])
AC_CACHE_CHECK([host CPU and C ABI], [gl_cv_host_cpu_c_abi],
[case "$host_cpu" in
changequote(,)dnl
i[4567]86 )
changequote([,])dnl
gl_cv_host_cpu_c_abi=i386
;;
x86_64 )
# On x86_64 systems, the C compiler may be generating code in one of
# these ABIs:
# - 64-bit instruction set, 64-bit pointers, 64-bit 'long': x86_64.
# - 64-bit instruction set, 64-bit pointers, 32-bit 'long': x86_64
# with native Windows (mingw, MSVC).
# - 64-bit instruction set, 32-bit pointers, 32-bit 'long': x86_64-x32.
# - 32-bit instruction set, 32-bit pointers, 32-bit 'long': i386.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if (defined __x86_64__ || defined __amd64__ \
|| defined _M_X64 || defined _M_AMD64)
int ok;
#else
error fail
#endif
]])],
[AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if defined __ILP32__ || defined _ILP32
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi=x86_64-x32],
[gl_cv_host_cpu_c_abi=x86_64])],
[gl_cv_host_cpu_c_abi=i386])
;;
changequote(,)dnl
alphaev[4-8] | alphaev56 | alphapca5[67] | alphaev6[78] )
changequote([,])dnl
gl_cv_host_cpu_c_abi=alpha
;;
arm* | aarch64 )
# Assume arm with EABI.
# On arm64 systems, the C compiler may be generating code in one of
# these ABIs:
# - aarch64 instruction set, 64-bit pointers, 64-bit 'long': arm64.
# - aarch64 instruction set, 32-bit pointers, 32-bit 'long': arm64-ilp32.
# - 32-bit instruction set, 32-bit pointers, 32-bit 'long': arm or armhf.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#ifdef __aarch64__
int ok;
#else
error fail
#endif
]])],
[AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if defined __ILP32__ || defined _ILP32
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi=arm64-ilp32],
[gl_cv_host_cpu_c_abi=arm64])],
[# Don't distinguish little-endian and big-endian arm, since they
# don't require different machine code for simple operations and
# since the user can distinguish them through the preprocessor
# defines __ARMEL__ vs. __ARMEB__.
# But distinguish arm which passes floating-point arguments and
# return values in integer registers (r0, r1, ...) - this is
# gcc -mfloat-abi=soft or gcc -mfloat-abi=softfp - from arm which
# passes them in float registers (s0, s1, ...) and double registers
# (d0, d1, ...) - this is gcc -mfloat-abi=hard. GCC 4.6 or newer
# sets the preprocessor defines __ARM_PCS (for the first case) and
# __ARM_PCS_VFP (for the second case), but older GCC does not.
echo 'double ddd; void func (double dd) { ddd = dd; }' > conftest.c
# Look for a reference to the register d0 in the .s file.
AC_TRY_COMMAND(${CC-cc} $CFLAGS $CPPFLAGS $gl_c_asm_opt conftest.c) >/dev/null 2>&1
if LC_ALL=C grep 'd0,' conftest.$gl_asmext >/dev/null; then
gl_cv_host_cpu_c_abi=armhf
else
gl_cv_host_cpu_c_abi=arm
fi
rm -f conftest*
])
;;
hppa1.0 | hppa1.1 | hppa2.0* | hppa64 )
# On hppa, the C compiler may be generating 32-bit code or 64-bit
# code. In the latter case, it defines _LP64 and __LP64__.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#ifdef __LP64__
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi=hppa64],
[gl_cv_host_cpu_c_abi=hppa])
;;
ia64* )
# On ia64 on HP-UX, the C compiler may be generating 64-bit code or
# 32-bit code. In the latter case, it defines _ILP32.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#ifdef _ILP32
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi=ia64-ilp32],
[gl_cv_host_cpu_c_abi=ia64])
;;
mips* )
# We should also check for (_MIPS_SZPTR == 64), but gcc keeps this
# at 32.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if defined _MIPS_SZLONG && (_MIPS_SZLONG == 64)
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi=mips64],
[# In the n32 ABI, _ABIN32 is defined, _ABIO32 is not defined (but
# may later get defined by <sgidefs.h>), and _MIPS_SIM == _ABIN32.
# In the 32 ABI, _ABIO32 is defined, _ABIN32 is not defined (but
# may later get defined by <sgidefs.h>), and _MIPS_SIM == _ABIO32.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if (_MIPS_SIM == _ABIN32)
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi=mipsn32],
[gl_cv_host_cpu_c_abi=mips])])
;;
powerpc* )
# Different ABIs are in use on AIX vs. Mac OS X vs. Linux,*BSD.
# No need to distinguish them here; the caller may distinguish
# them based on the OS.
# On powerpc64 systems, the C compiler may still be generating
# 32-bit code. And on powerpc-ibm-aix systems, the C compiler may
# be generating 64-bit code.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if defined __powerpc64__ || defined _ARCH_PPC64
int ok;
#else
error fail
#endif
]])],
[# On powerpc64, there are two ABIs on Linux: The AIX compatible
# one and the ELFv2 one. The latter defines _CALL_ELF=2.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if defined _CALL_ELF && _CALL_ELF == 2
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi=powerpc64-elfv2],
[gl_cv_host_cpu_c_abi=powerpc64])
],
[gl_cv_host_cpu_c_abi=powerpc])
;;
rs6000 )
gl_cv_host_cpu_c_abi=powerpc
;;
riscv32 | riscv64 )
# There are 2 architectures (with variants): rv32* and rv64*.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if __riscv_xlen == 64
int ok;
#else
error fail
#endif
]])],
[cpu=riscv64],
[cpu=riscv32])
# There are 6 ABIs: ilp32, ilp32f, ilp32d, lp64, lp64f, lp64d.
# Size of 'long' and 'void *':
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if defined __LP64__
int ok;
#else
error fail
#endif
]])],
[main_abi=lp64],
[main_abi=ilp32])
# Float ABIs:
# __riscv_float_abi_double:
# 'float' and 'double' are passed in floating-point registers.
# __riscv_float_abi_single:
# 'float' are passed in floating-point registers.
# __riscv_float_abi_soft:
# No values are passed in floating-point registers.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if defined __riscv_float_abi_double
int ok;
#else
error fail
#endif
]])],
[float_abi=d],
[AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if defined __riscv_float_abi_single
int ok;
#else
error fail
#endif
]])],
[float_abi=f],
[float_abi=''])
])
gl_cv_host_cpu_c_abi="${cpu}-${main_abi}${float_abi}"
;;
s390* )
# On s390x, the C compiler may be generating 64-bit (= s390x) code
# or 31-bit (= s390) code.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if defined __LP64__ || defined __s390x__
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi=s390x],
[gl_cv_host_cpu_c_abi=s390])
;;
sparc | sparc64 )
# UltraSPARCs running Linux have `uname -m` = "sparc64", but the
# C compiler still generates 32-bit code.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if defined __sparcv9 || defined __arch64__
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi=sparc64],
[gl_cv_host_cpu_c_abi=sparc])
;;
*)
gl_cv_host_cpu_c_abi="$host_cpu"
;;
esac
])
dnl In most cases, $HOST_CPU and $HOST_CPU_C_ABI are the same.
HOST_CPU=`echo "$gl_cv_host_cpu_c_abi" | sed -e 's/-.*//'`
HOST_CPU_C_ABI="$gl_cv_host_cpu_c_abi"
AC_SUBST([HOST_CPU])
AC_SUBST([HOST_CPU_C_ABI])
# This was
# AC_DEFINE_UNQUOTED([__${HOST_CPU}__])
# AC_DEFINE_UNQUOTED([__${HOST_CPU_C_ABI}__])
# earlier, but KAI C++ 3.2d doesn't like this.
sed -e 's/-/_/g' >> confdefs.h <<EOF
#ifndef __${HOST_CPU}__
#define __${HOST_CPU}__ 1
#endif
#ifndef __${HOST_CPU_C_ABI}__
#define __${HOST_CPU_C_ABI}__ 1
#endif
EOF
AH_TOP([/* CPU and C ABI indicator */
#ifndef __i386__
#undef __i386__
#endif
#ifndef __x86_64_x32__
#undef __x86_64_x32__
#endif
#ifndef __x86_64__
#undef __x86_64__
#endif
#ifndef __alpha__
#undef __alpha__
#endif
#ifndef __arm__
#undef __arm__
#endif
#ifndef __armhf__
#undef __armhf__
#endif
#ifndef __arm64_ilp32__
#undef __arm64_ilp32__
#endif
#ifndef __arm64__
#undef __arm64__
#endif
#ifndef __hppa__
#undef __hppa__
#endif
#ifndef __hppa64__
#undef __hppa64__
#endif
#ifndef __ia64_ilp32__
#undef __ia64_ilp32__
#endif
#ifndef __ia64__
#undef __ia64__
#endif
#ifndef __m68k__
#undef __m68k__
#endif
#ifndef __mips__
#undef __mips__
#endif
#ifndef __mipsn32__
#undef __mipsn32__
#endif
#ifndef __mips64__
#undef __mips64__
#endif
#ifndef __powerpc__
#undef __powerpc__
#endif
#ifndef __powerpc64__
#undef __powerpc64__
#endif
#ifndef __powerpc64_elfv2__
#undef __powerpc64_elfv2__
#endif
#ifndef __riscv32__
#undef __riscv32__
#endif
#ifndef __riscv64__
#undef __riscv64__
#endif
#ifndef __riscv32_ilp32__
#undef __riscv32_ilp32__
#endif
#ifndef __riscv32_ilp32f__
#undef __riscv32_ilp32f__
#endif
#ifndef __riscv32_ilp32d__
#undef __riscv32_ilp32d__
#endif
#ifndef __riscv64_ilp32__
#undef __riscv64_ilp32__
#endif
#ifndef __riscv64_ilp32f__
#undef __riscv64_ilp32f__
#endif
#ifndef __riscv64_ilp32d__
#undef __riscv64_ilp32d__
#endif
#ifndef __riscv64_lp64__
#undef __riscv64_lp64__
#endif
#ifndef __riscv64_lp64f__
#undef __riscv64_lp64f__
#endif
#ifndef __riscv64_lp64d__
#undef __riscv64_lp64d__
#endif
#ifndef __s390__
#undef __s390__
#endif
#ifndef __s390x__
#undef __s390x__
#endif
#ifndef __sh__
#undef __sh__
#endif
#ifndef __sparc__
#undef __sparc__
#endif
#ifndef __sparc64__
#undef __sparc64__
#endif
])
])
dnl Sets the HOST_CPU_C_ABI_32BIT variable to 'yes' if the C language ABI
dnl (application binary interface) is a 32-bit one, or to 'no' otherwise.
dnl This is a simplified variant of gl_HOST_CPU_C_ABI.
AC_DEFUN([gl_HOST_CPU_C_ABI_32BIT],
[
AC_REQUIRE([AC_CANONICAL_HOST])
AC_CACHE_CHECK([32-bit host C ABI], [gl_cv_host_cpu_c_abi_32bit],
[if test -n "$gl_cv_host_cpu_c_abi"; then
case "$gl_cv_host_cpu_c_abi" in
i386 | x86_64-x32 | arm | armhf | arm64-ilp32 | hppa | ia64-ilp32 | mips | mipsn32 | powerpc | riscv*-ilp32* | s390 | sparc)
gl_cv_host_cpu_c_abi_32bit=yes ;;
*)
gl_cv_host_cpu_c_abi_32bit=no ;;
esac
else
case "$host_cpu" in
changequote(,)dnl
i[4567]86 )
changequote([,])dnl
gl_cv_host_cpu_c_abi_32bit=yes
;;
x86_64 )
# On x86_64 systems, the C compiler may be generating code in one of
# these ABIs:
# - 64-bit instruction set, 64-bit pointers, 64-bit 'long': x86_64.
# - 64-bit instruction set, 64-bit pointers, 32-bit 'long': x86_64
# with native Windows (mingw, MSVC).
# - 64-bit instruction set, 32-bit pointers, 32-bit 'long': x86_64-x32.
# - 32-bit instruction set, 32-bit pointers, 32-bit 'long': i386.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if (defined __x86_64__ || defined __amd64__ \
|| defined _M_X64 || defined _M_AMD64) \
&& !(defined __ILP32__ || defined _ILP32)
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi_32bit=no],
[gl_cv_host_cpu_c_abi_32bit=yes])
;;
arm* | aarch64 )
# Assume arm with EABI.
# On arm64 systems, the C compiler may be generating code in one of
# these ABIs:
# - aarch64 instruction set, 64-bit pointers, 64-bit 'long': arm64.
# - aarch64 instruction set, 32-bit pointers, 32-bit 'long': arm64-ilp32.
# - 32-bit instruction set, 32-bit pointers, 32-bit 'long': arm or armhf.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if defined __aarch64__ && !(defined __ILP32__ || defined _ILP32)
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi_32bit=no],
[gl_cv_host_cpu_c_abi_32bit=yes])
;;
hppa1.0 | hppa1.1 | hppa2.0* | hppa64 )
# On hppa, the C compiler may be generating 32-bit code or 64-bit
# code. In the latter case, it defines _LP64 and __LP64__.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#ifdef __LP64__
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi_32bit=no],
[gl_cv_host_cpu_c_abi_32bit=yes])
;;
ia64* )
# On ia64 on HP-UX, the C compiler may be generating 64-bit code or
# 32-bit code. In the latter case, it defines _ILP32.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#ifdef _ILP32
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi_32bit=yes],
[gl_cv_host_cpu_c_abi_32bit=no])
;;
mips* )
# We should also check for (_MIPS_SZPTR == 64), but gcc keeps this
# at 32.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if defined _MIPS_SZLONG && (_MIPS_SZLONG == 64)
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi_32bit=no],
[gl_cv_host_cpu_c_abi_32bit=yes])
;;
powerpc* )
# Different ABIs are in use on AIX vs. Mac OS X vs. Linux,*BSD.
# No need to distinguish them here; the caller may distinguish
# them based on the OS.
# On powerpc64 systems, the C compiler may still be generating
# 32-bit code. And on powerpc-ibm-aix systems, the C compiler may
# be generating 64-bit code.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if defined __powerpc64__ || defined _ARCH_PPC64
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi_32bit=no],
[gl_cv_host_cpu_c_abi_32bit=yes])
;;
rs6000 )
gl_cv_host_cpu_c_abi_32bit=yes
;;
riscv32 | riscv64 )
# There are 6 ABIs: ilp32, ilp32f, ilp32d, lp64, lp64f, lp64d.
# Size of 'long' and 'void *':
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if defined __LP64__
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi_32bit=no],
[gl_cv_host_cpu_c_abi_32bit=yes])
;;
s390* )
# On s390x, the C compiler may be generating 64-bit (= s390x) code
# or 31-bit (= s390) code.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if defined __LP64__ || defined __s390x__
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi_32bit=no],
[gl_cv_host_cpu_c_abi_32bit=yes])
;;
sparc | sparc64 )
# UltraSPARCs running Linux have `uname -m` = "sparc64", but the
# C compiler still generates 32-bit code.
AC_COMPILE_IFELSE(
[AC_LANG_SOURCE(
[[#if defined __sparcv9 || defined __arch64__
int ok;
#else
error fail
#endif
]])],
[gl_cv_host_cpu_c_abi_32bit=no],
[gl_cv_host_cpu_c_abi_32bit=yes])
;;
*)
gl_cv_host_cpu_c_abi_32bit=no
;;
esac
fi
])
HOST_CPU_C_ABI_32BIT="$gl_cv_host_cpu_c_abi_32bit"
])
+288
View File
@@ -0,0 +1,288 @@
# iconv.m4 serial 21
dnl Copyright (C) 2000-2002, 2007-2014, 2016-2019 Free Software Foundation,
dnl Inc.
dnl This file is free software; the Free Software Foundation
dnl gives unlimited permission to copy and/or distribute it,
dnl with or without modifications, as long as this notice is preserved.
dnl From Bruno Haible.
AC_DEFUN([AM_ICONV_LINKFLAGS_BODY],
[
dnl Prerequisites of AC_LIB_LINKFLAGS_BODY.
AC_REQUIRE([AC_LIB_PREPARE_PREFIX])
AC_REQUIRE([AC_LIB_RPATH])
dnl Search for libiconv and define LIBICONV, LTLIBICONV and INCICONV
dnl accordingly.
AC_LIB_LINKFLAGS_BODY([iconv])
])
AC_DEFUN([AM_ICONV_LINK],
[
dnl Some systems have iconv in libc, some have it in libiconv (OSF/1 and
dnl those with the standalone portable GNU libiconv installed).
AC_REQUIRE([AC_CANONICAL_HOST]) dnl for cross-compiles
dnl Search for libiconv and define LIBICONV, LTLIBICONV and INCICONV
dnl accordingly.
AC_REQUIRE([AM_ICONV_LINKFLAGS_BODY])
dnl Add $INCICONV to CPPFLAGS before performing the following checks,
dnl because if the user has installed libiconv and not disabled its use
dnl via --without-libiconv-prefix, he wants to use it. The first
dnl AC_LINK_IFELSE will then fail, the second AC_LINK_IFELSE will succeed.
am_save_CPPFLAGS="$CPPFLAGS"
AC_LIB_APPENDTOVAR([CPPFLAGS], [$INCICONV])
AC_CACHE_CHECK([for iconv], [am_cv_func_iconv], [
am_cv_func_iconv="no, consider installing GNU libiconv"
am_cv_lib_iconv=no
AC_LINK_IFELSE(
[AC_LANG_PROGRAM(
[[
#include <stdlib.h>
#include <iconv.h>
]],
[[iconv_t cd = iconv_open("","");
iconv(cd,NULL,NULL,NULL,NULL);
iconv_close(cd);]])],
[am_cv_func_iconv=yes])
if test "$am_cv_func_iconv" != yes; then
am_save_LIBS="$LIBS"
LIBS="$LIBS $LIBICONV"
AC_LINK_IFELSE(
[AC_LANG_PROGRAM(
[[
#include <stdlib.h>
#include <iconv.h>
]],
[[iconv_t cd = iconv_open("","");
iconv(cd,NULL,NULL,NULL,NULL);
iconv_close(cd);]])],
[am_cv_lib_iconv=yes]
[am_cv_func_iconv=yes])
LIBS="$am_save_LIBS"
fi
])
if test "$am_cv_func_iconv" = yes; then
AC_CACHE_CHECK([for working iconv], [am_cv_func_iconv_works], [
dnl This tests against bugs in AIX 5.1, AIX 6.1..7.1, HP-UX 11.11,
dnl Solaris 10.
am_save_LIBS="$LIBS"
if test $am_cv_lib_iconv = yes; then
LIBS="$LIBS $LIBICONV"
fi
am_cv_func_iconv_works=no
for ac_iconv_const in '' 'const'; do
AC_RUN_IFELSE(
[AC_LANG_PROGRAM(
[[
#include <iconv.h>
#include <string.h>
#ifndef ICONV_CONST
# define ICONV_CONST $ac_iconv_const
#endif
]],
[[int result = 0;
/* Test against AIX 5.1 bug: Failures are not distinguishable from successful
returns. */
{
iconv_t cd_utf8_to_88591 = iconv_open ("ISO8859-1", "UTF-8");
if (cd_utf8_to_88591 != (iconv_t)(-1))
{
static ICONV_CONST char input[] = "\342\202\254"; /* EURO SIGN */
char buf[10];
ICONV_CONST char *inptr = input;
size_t inbytesleft = strlen (input);
char *outptr = buf;
size_t outbytesleft = sizeof (buf);
size_t res = iconv (cd_utf8_to_88591,
&inptr, &inbytesleft,
&outptr, &outbytesleft);
if (res == 0)
result |= 1;
iconv_close (cd_utf8_to_88591);
}
}
/* Test against Solaris 10 bug: Failures are not distinguishable from
successful returns. */
{
iconv_t cd_ascii_to_88591 = iconv_open ("ISO8859-1", "646");
if (cd_ascii_to_88591 != (iconv_t)(-1))
{
static ICONV_CONST char input[] = "\263";
char buf[10];
ICONV_CONST char *inptr = input;
size_t inbytesleft = strlen (input);
char *outptr = buf;
size_t outbytesleft = sizeof (buf);
size_t res = iconv (cd_ascii_to_88591,
&inptr, &inbytesleft,
&outptr, &outbytesleft);
if (res == 0)
result |= 2;
iconv_close (cd_ascii_to_88591);
}
}
/* Test against AIX 6.1..7.1 bug: Buffer overrun. */
{
iconv_t cd_88591_to_utf8 = iconv_open ("UTF-8", "ISO-8859-1");
if (cd_88591_to_utf8 != (iconv_t)(-1))
{
static ICONV_CONST char input[] = "\304";
static char buf[2] = { (char)0xDE, (char)0xAD };
ICONV_CONST char *inptr = input;
size_t inbytesleft = 1;
char *outptr = buf;
size_t outbytesleft = 1;
size_t res = iconv (cd_88591_to_utf8,
&inptr, &inbytesleft,
&outptr, &outbytesleft);
if (res != (size_t)(-1) || outptr - buf > 1 || buf[1] != (char)0xAD)
result |= 4;
iconv_close (cd_88591_to_utf8);
}
}
#if 0 /* This bug could be worked around by the caller. */
/* Test against HP-UX 11.11 bug: Positive return value instead of 0. */
{
iconv_t cd_88591_to_utf8 = iconv_open ("utf8", "iso88591");
if (cd_88591_to_utf8 != (iconv_t)(-1))
{
static ICONV_CONST char input[] = "\304rger mit b\366sen B\374bchen ohne Augenma\337";
char buf[50];
ICONV_CONST char *inptr = input;
size_t inbytesleft = strlen (input);
char *outptr = buf;
size_t outbytesleft = sizeof (buf);
size_t res = iconv (cd_88591_to_utf8,
&inptr, &inbytesleft,
&outptr, &outbytesleft);
if ((int)res > 0)
result |= 8;
iconv_close (cd_88591_to_utf8);
}
}
#endif
/* Test against HP-UX 11.11 bug: No converter from EUC-JP to UTF-8 is
provided. */
{
/* Try standardized names. */
iconv_t cd1 = iconv_open ("UTF-8", "EUC-JP");
/* Try IRIX, OSF/1 names. */
iconv_t cd2 = iconv_open ("UTF-8", "eucJP");
/* Try AIX names. */
iconv_t cd3 = iconv_open ("UTF-8", "IBM-eucJP");
/* Try HP-UX names. */
iconv_t cd4 = iconv_open ("utf8", "eucJP");
if (cd1 == (iconv_t)(-1) && cd2 == (iconv_t)(-1)
&& cd3 == (iconv_t)(-1) && cd4 == (iconv_t)(-1))
result |= 16;
if (cd1 != (iconv_t)(-1))
iconv_close (cd1);
if (cd2 != (iconv_t)(-1))
iconv_close (cd2);
if (cd3 != (iconv_t)(-1))
iconv_close (cd3);
if (cd4 != (iconv_t)(-1))
iconv_close (cd4);
}
return result;
]])],
[am_cv_func_iconv_works=yes], ,
[case "$host_os" in
aix* | hpux*) am_cv_func_iconv_works="guessing no" ;;
*) am_cv_func_iconv_works="guessing yes" ;;
esac])
test "$am_cv_func_iconv_works" = no || break
done
LIBS="$am_save_LIBS"
])
case "$am_cv_func_iconv_works" in
*no) am_func_iconv=no am_cv_lib_iconv=no ;;
*) am_func_iconv=yes ;;
esac
else
am_func_iconv=no am_cv_lib_iconv=no
fi
if test "$am_func_iconv" = yes; then
AC_DEFINE([HAVE_ICONV], [1],
[Define if you have the iconv() function and it works.])
fi
if test "$am_cv_lib_iconv" = yes; then
AC_MSG_CHECKING([how to link with libiconv])
AC_MSG_RESULT([$LIBICONV])
else
dnl If $LIBICONV didn't lead to a usable library, we don't need $INCICONV
dnl either.
CPPFLAGS="$am_save_CPPFLAGS"
LIBICONV=
LTLIBICONV=
fi
AC_SUBST([LIBICONV])
AC_SUBST([LTLIBICONV])
])
dnl Define AM_ICONV using AC_DEFUN_ONCE for Autoconf >= 2.64, in order to
dnl avoid warnings like
dnl "warning: AC_REQUIRE: `AM_ICONV' was expanded before it was required".
dnl This is tricky because of the way 'aclocal' is implemented:
dnl - It requires defining an auxiliary macro whose name ends in AC_DEFUN.
dnl Otherwise aclocal's initial scan pass would miss the macro definition.
dnl - It requires a line break inside the AC_DEFUN_ONCE and AC_DEFUN expansions.
dnl Otherwise aclocal would emit many "Use of uninitialized value $1"
dnl warnings.
m4_define([gl_iconv_AC_DEFUN],
m4_version_prereq([2.64],
[[AC_DEFUN_ONCE(
[$1], [$2])]],
[m4_ifdef([gl_00GNULIB],
[[AC_DEFUN_ONCE(
[$1], [$2])]],
[[AC_DEFUN(
[$1], [$2])]])]))
gl_iconv_AC_DEFUN([AM_ICONV],
[
AM_ICONV_LINK
if test "$am_cv_func_iconv" = yes; then
AC_MSG_CHECKING([for iconv declaration])
AC_CACHE_VAL([am_cv_proto_iconv], [
AC_COMPILE_IFELSE(
[AC_LANG_PROGRAM(
[[
#include <stdlib.h>
#include <iconv.h>
extern
#ifdef __cplusplus
"C"
#endif
#if defined(__STDC__) || defined(_MSC_VER) || defined(__cplusplus)
size_t iconv (iconv_t cd, char * *inbuf, size_t *inbytesleft, char * *outbuf, size_t *outbytesleft);
#else
size_t iconv();
#endif
]],
[[]])],
[am_cv_proto_iconv_arg1=""],
[am_cv_proto_iconv_arg1="const"])
am_cv_proto_iconv="extern size_t iconv (iconv_t cd, $am_cv_proto_iconv_arg1 char * *inbuf, size_t *inbytesleft, char * *outbuf, size_t *outbytesleft);"])
am_cv_proto_iconv=`echo "[$]am_cv_proto_iconv" | tr -s ' ' | sed -e 's/( /(/'`
AC_MSG_RESULT([
$am_cv_proto_iconv])
else
dnl When compiling GNU libiconv on a system that does not have iconv yet,
dnl pick the POSIX compliant declaration without 'const'.
am_cv_proto_iconv_arg1=""
fi
AC_DEFINE_UNQUOTED([ICONV_CONST], [$am_cv_proto_iconv_arg1],
[Define as const if the declaration of iconv() needs const.])
dnl Also substitute ICONV_CONST in the gnulib generated <iconv.h>.
m4_ifdef([gl_ICONV_H_DEFAULTS],
[AC_REQUIRE([gl_ICONV_H_DEFAULTS])
if test -n "$am_cv_proto_iconv_arg1"; then
ICONV_CONST="const"
fi
])
])
+72
View File
@@ -0,0 +1,72 @@
# intlmacosx.m4 serial 6 (gettext-0.20)
dnl Copyright (C) 2004-2014, 2016, 2019 Free Software Foundation, Inc.
dnl This file is free software; the Free Software Foundation
dnl gives unlimited permission to copy and/or distribute it,
dnl with or without modifications, as long as this notice is preserved.
dnl
dnl This file can be used in projects which are not available under
dnl the GNU General Public License or the GNU Library General Public
dnl License but which still want to provide support for the GNU gettext
dnl functionality.
dnl Please note that the actual code of the GNU gettext library is covered
dnl by the GNU Library General Public License, and the rest of the GNU
dnl gettext package is covered by the GNU General Public License.
dnl They are *not* in the public domain.
dnl Checks for special options needed on Mac OS X.
dnl Defines INTL_MACOSX_LIBS.
AC_DEFUN([gt_INTL_MACOSX],
[
dnl Check for API introduced in Mac OS X 10.4.
AC_CACHE_CHECK([for CFPreferencesCopyAppValue],
[gt_cv_func_CFPreferencesCopyAppValue],
[gt_save_LIBS="$LIBS"
LIBS="$LIBS -Wl,-framework -Wl,CoreFoundation"
AC_LINK_IFELSE(
[AC_LANG_PROGRAM(
[[#include <CoreFoundation/CFPreferences.h>]],
[[CFPreferencesCopyAppValue(NULL, NULL)]])],
[gt_cv_func_CFPreferencesCopyAppValue=yes],
[gt_cv_func_CFPreferencesCopyAppValue=no])
LIBS="$gt_save_LIBS"])
if test $gt_cv_func_CFPreferencesCopyAppValue = yes; then
AC_DEFINE([HAVE_CFPREFERENCESCOPYAPPVALUE], [1],
[Define to 1 if you have the Mac OS X function CFPreferencesCopyAppValue in the CoreFoundation framework.])
fi
dnl Check for API introduced in Mac OS X 10.5.
AC_CACHE_CHECK([for CFLocaleCopyCurrent], [gt_cv_func_CFLocaleCopyCurrent],
[gt_save_LIBS="$LIBS"
LIBS="$LIBS -Wl,-framework -Wl,CoreFoundation"
AC_LINK_IFELSE(
[AC_LANG_PROGRAM(
[[#include <CoreFoundation/CFLocale.h>]],
[[CFLocaleCopyCurrent();]])],
[gt_cv_func_CFLocaleCopyCurrent=yes],
[gt_cv_func_CFLocaleCopyCurrent=no])
LIBS="$gt_save_LIBS"])
if test $gt_cv_func_CFLocaleCopyCurrent = yes; then
AC_DEFINE([HAVE_CFLOCALECOPYCURRENT], [1],
[Define to 1 if you have the Mac OS X function CFLocaleCopyCurrent in the CoreFoundation framework.])
fi
AC_CACHE_CHECK([for CFLocaleCopyPreferredLanguages], [gt_cv_func_CFLocaleCopyPreferredLanguages],
[gt_save_LIBS="$LIBS"
LIBS="$LIBS -Wl,-framework -Wl,CoreFoundation"
AC_LINK_IFELSE(
[AC_LANG_PROGRAM(
[[#include <CoreFoundation/CFLocale.h>]],
[[CFLocaleCopyPreferredLanguages();]])],
[gt_cv_func_CFLocaleCopyPreferredLanguages=yes],
[gt_cv_func_CFLocaleCopyPreferredLanguages=no])
LIBS="$gt_save_LIBS"])
if test $gt_cv_func_CFLocaleCopyPreferredLanguages = yes; then
AC_DEFINE([HAVE_CFLOCALECOPYPREFERREDLANGUAGES], [1],
[Define to 1 if you have the Mac OS X function CFLocaleCopyPreferredLanguages in the CoreFoundation framework.])
fi
INTL_MACOSX_LIBS=
if test $gt_cv_func_CFPreferencesCopyAppValue = yes \
|| test $gt_cv_func_CFLocaleCopyCurrent = yes \
|| test $gt_cv_func_CFLocaleCopyPreferredLanguages = yes; then
INTL_MACOSX_LIBS="-Wl,-framework -Wl,CoreFoundation"
fi
AC_SUBST([INTL_MACOSX_LIBS])
])
+65
View File
@@ -0,0 +1,65 @@
dnl #
dnl # 2.6.32 - 4.x API,
dnl # blk_queue_discard()
dnl #
AC_DEFUN([ZFS_AC_KERNEL_BLK_QUEUE_DISCARD], [
AC_MSG_CHECKING([whether blk_queue_discard() is available])
ZFS_LINUX_TRY_COMPILE([
#include <linux/blkdev.h>
],[
struct request_queue *q __attribute__ ((unused)) = NULL;
int value __attribute__ ((unused));
value = blk_queue_discard(q);
],[
AC_MSG_RESULT(yes)
AC_DEFINE(HAVE_BLK_QUEUE_DISCARD, 1,
[blk_queue_discard() is available])
],[
AC_MSG_RESULT(no)
])
])
dnl #
dnl # 4.8 - 4.x API,
dnl # blk_queue_secure_erase()
dnl #
dnl # 2.6.36 - 4.7 API,
dnl # blk_queue_secdiscard()
dnl #
dnl # 2.6.x - 2.6.35 API,
dnl # Unsupported by kernel
dnl #
AC_DEFUN([ZFS_AC_KERNEL_BLK_QUEUE_SECURE_ERASE], [
AC_MSG_CHECKING([whether blk_queue_secure_erase() is available])
ZFS_LINUX_TRY_COMPILE([
#include <linux/blkdev.h>
],[
struct request_queue *q __attribute__ ((unused)) = NULL;
int value __attribute__ ((unused));
value = blk_queue_secure_erase(q);
],[
AC_MSG_RESULT(yes)
AC_DEFINE(HAVE_BLK_QUEUE_SECURE_ERASE, 1,
[blk_queue_secure_erase() is available])
],[
AC_MSG_RESULT(no)
AC_MSG_CHECKING([whether blk_queue_secdiscard() is available])
ZFS_LINUX_TRY_COMPILE([
#include <linux/blkdev.h>
],[
struct request_queue *q __attribute__ ((unused)) = NULL;
int value __attribute__ ((unused));
value = blk_queue_secdiscard(q);
],[
AC_MSG_RESULT(yes)
AC_DEFINE(HAVE_BLK_QUEUE_SECDISCARD, 1,
[blk_queue_secdiscard() is available])
],[
AC_MSG_RESULT(no)
])
])
])
-19
View File
@@ -1,19 +0,0 @@
dnl #
dnl # 2.6.37 API change
dnl # Added 3rd argument for the active holder, previously this was
dnl # hardcoded to NULL.
dnl #
AC_DEFUN([ZFS_AC_KERNEL_3ARG_BLKDEV_GET], [
AC_MSG_CHECKING([whether blkdev_get() wants 3 args])
ZFS_LINUX_TRY_COMPILE([
#include <linux/fs.h>
],[
struct block_device *bdev = NULL;
(void) blkdev_get(bdev, 0, NULL);
],[
AC_MSG_RESULT(yes)
AC_DEFINE(HAVE_3ARG_BLKDEV_GET, 1, [blkdev_get() wants 3 args])
],[
AC_MSG_RESULT(no)
])
])
+21
View File
@@ -0,0 +1,21 @@
dnl #
dnl # 4.1 API, exported blkdev_reread_part() symbol, backported to the
dnl # 3.10.0 CentOS 7.x enterprise kernels.
dnl #
AC_DEFUN([ZFS_AC_KERNEL_BLKDEV_REREAD_PART], [
AC_MSG_CHECKING([whether blkdev_reread_part() is available])
ZFS_LINUX_TRY_COMPILE([
#include <linux/fs.h>
], [
struct block_device *bdev = NULL;
int error;
error = blkdev_reread_part(bdev);
], [
AC_MSG_RESULT(yes)
AC_DEFINE(HAVE_BLKDEV_REREAD_PART, 1,
[blkdev_reread_part() is available])
], [
AC_MSG_RESULT(no)
])
])
+18
View File
@@ -0,0 +1,18 @@
dnl #
dnl # 2.6.33 API change,
dnl # Removed .ctl_name from struct ctl_table.
dnl #
AC_DEFUN([ZFS_AC_KERNEL_CTL_NAME], [
AC_MSG_CHECKING([whether struct ctl_table has ctl_name])
ZFS_LINUX_TRY_COMPILE([
#include <linux/sysctl.h>
],[
struct ctl_table ctl __attribute__ ((unused));
ctl.ctl_name = 0;
],[
AC_MSG_RESULT(yes)
AC_DEFINE(HAVE_CTL_NAME, 1, [struct ctl_table has ctl_name])
],[
AC_MSG_RESULT(no)
])
])
+19
View File
@@ -0,0 +1,19 @@
dnl #
dnl # PaX Linux 2.6.38 - 3.x API
dnl #
AC_DEFUN([ZFS_AC_PAX_KERNEL_FILE_FALLOCATE], [
AC_MSG_CHECKING([whether fops->fallocate() exists])
ZFS_LINUX_TRY_COMPILE([
#include <linux/fs.h>
],[
long (*fallocate) (struct file *, int, loff_t, loff_t) = NULL;
struct file_operations_no_const fops __attribute__ ((unused)) = {
.fallocate = fallocate,
};
],[
AC_MSG_RESULT(yes)
AC_DEFINE(HAVE_FILE_FALLOCATE, 1, [fops->fallocate() exists])
],[
AC_MSG_RESULT(no)
])
])
+34 -7
View File
@@ -12,25 +12,52 @@ dnl # Pre-4.2: Use kernel_fpu_{begin,end}()
dnl # HAVE_KERNEL_FPU & KERNEL_EXPORTS_X86_FPU
dnl #
AC_DEFUN([ZFS_AC_KERNEL_FPU], [
AC_MSG_CHECKING([which kernel_fpu function to use])
AC_MSG_CHECKING([which kernel_fpu header to use])
ZFS_LINUX_TRY_COMPILE([
#include <linux/module.h>
#include <asm/fpu/api.h>
],[
],[
AC_DEFINE(HAVE_KERNEL_FPU_API_HEADER, 1,
[kernel has asm/fpu/api.h])
AC_MSG_RESULT(asm/fpu/api.h)
],[
AC_MSG_RESULT(i387.h & xcr.h)
])
AC_MSG_CHECKING([which kernel_fpu function to use])
ZFS_LINUX_TRY_COMPILE_SYMBOL([
#include <linux/module.h>
#ifdef HAVE_KERNEL_FPU_API_HEADER
#include <asm/fpu/api.h>
#else
#include <asm/i387.h>
#include <asm/xcr.h>
#endif
MODULE_LICENSE("$ZFS_META_LICENSE");
],[
kernel_fpu_begin();
kernel_fpu_end();
],[
], [kernel_fpu_begin], [arch/x86/kernel/fpu/core.c], [
AC_MSG_RESULT(kernel_fpu_*)
AC_DEFINE(HAVE_KERNEL_FPU, 1, [kernel has kernel_fpu_* functions])
AC_DEFINE(KERNEL_EXPORTS_X86_FPU, 1, [kernel exports FPU functions])
AC_DEFINE(HAVE_KERNEL_FPU, 1,
[kernel has kernel_fpu_* functions])
AC_DEFINE(KERNEL_EXPORTS_X86_FPU, 1,
[kernel exports FPU functions])
],[
ZFS_LINUX_TRY_COMPILE([
#include <linux/kernel.h>
ZFS_LINUX_TRY_COMPILE_SYMBOL([
#include <linux/module.h>
#ifdef HAVE_KERNEL_FPU_API_HEADER
#include <asm/fpu/api.h>
#else
#include <asm/i387.h>
#include <asm/xcr.h>
#endif
MODULE_LICENSE("$ZFS_META_LICENSE");
],[
__kernel_fpu_begin();
__kernel_fpu_end();
],[
], [__kernel_fpu_begin], [arch/x86/kernel/fpu/core.c arch/x86/kernel/i387.c], [
AC_MSG_RESULT(__kernel_fpu_*)
AC_DEFINE(HAVE_UNDERSCORE_KERNEL_FPU, 1, [kernel has __kernel_fpu_* functions])
AC_DEFINE(KERNEL_EXPORTS_X86_FPU, 1, [kernel exports FPU functions])
-17
View File
@@ -1,17 +0,0 @@
dnl #
dnl # 2.6.34 API change
dnl # Verify the get_gendisk() symbol is available.
dnl #
AC_DEFUN([ZFS_AC_KERNEL_GET_GENDISK],
[AC_MSG_CHECKING([whether get_gendisk() is available])
ZFS_LINUX_TRY_COMPILE_SYMBOL([
#include <linux/genhd.h>
], [
get_gendisk(0, NULL);
], [get_gendisk], [block/genhd.c], [
AC_MSG_RESULT(yes)
AC_DEFINE(HAVE_GET_GENDISK, 1, [get_gendisk() is available])
], [
AC_MSG_RESULT(no)
])
])
+21
View File
@@ -0,0 +1,21 @@
dnl #
dnl # 4.9 API change
dnl # group_info changed from 2d array via >blocks to 1d array via ->gid
dnl #
AC_DEFUN([ZFS_AC_KERNEL_GROUP_INFO_GID], [
AC_MSG_CHECKING([whether group_info->gid exists])
tmp_flags="$EXTRA_KCFLAGS"
EXTRA_KCFLAGS="-Werror"
ZFS_LINUX_TRY_COMPILE([
#include <linux/cred.h>
],[
struct group_info *gi = groups_alloc(1);
gi->gid[0] = KGIDT_INIT(0);
],[
AC_MSG_RESULT(yes)
AC_DEFINE(HAVE_GROUP_INFO_GID, 1, [group_info->gid exists])
],[
AC_MSG_RESULT(no)
])
EXTRA_KCFLAGS="$tmp_flags"
])
+23
View File
@@ -0,0 +1,23 @@
dnl #
dnl # 4.7 API change
dnl # i_mutex is changed to i_rwsem. Instead of directly using
dnl # i_mutex/i_rwsem, we should use inode_lock() and inode_lock_shared()
dnl # We test inode_lock_shared because inode_lock is introduced earlier.
dnl #
AC_DEFUN([ZFS_AC_KERNEL_INODE_LOCK], [
AC_MSG_CHECKING([whether inode_lock_shared() exists])
tmp_flags="$EXTRA_KCFLAGS"
EXTRA_KCFLAGS="-Werror"
ZFS_LINUX_TRY_COMPILE([
#include <linux/fs.h>
],[
struct inode *inode = NULL;
inode_lock_shared(inode);
],[
AC_MSG_RESULT(yes)
AC_DEFINE(HAVE_INODE_LOCK_SHARED, 1, [yes])
],[
AC_MSG_RESULT(no)
])
EXTRA_KCFLAGS="$tmp_flags"
])

Some files were not shown because too many files have changed in this diff Show More