If a vnode is released asynchronously through areleasef(), it is
possible for the user process to reuse the file descriptor before
areleasef is called. When this happens, getf() will return a stale
reference, any operations in the kernel on that file descriptor will
fail (as it is closed) and the operations meant for that fd will
never occur from userspace's perspective.
We correct this by detecting this condition in getf(), doing a putf
on the old file handle, updating the file descriptor and proceeding
as if everything was fine. When the areleasef() is done, it will
harmlessly decrement the reference counter on the Illumos file handle.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#492
Currently, the SET_ERROR tracepoint triggers regardless of whether there
is an error or not. On Illumos, SET_ERROR only triggers on an actual
error, which is avoids irrelevant noise. Linux 2.6.38 added support for
conditional tracepoints, so we modify SET_ERROR to use them when they
are avaliable for functionality equivalent to the Illumos functionality.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4043
When dumping a block on a little endian system the data must be
byte swapped to display correctly. Example incorrect output:
$ echo 0123456789abcdef > aaa
$ zdb -eR pp 3:1ee00:200
3:1ee00:200
0 1 2 3 4 5 6 7 8 9 a b c d e f 0123456789abcdef
000000: 3736353433323130 6665646362613938 0123456789abcdef
000010: 000000000000000a 0000000000000000 ................
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4020
The current zdb calling behaviour is really fragile, and is guaranteed to
segfault if ztest is not installed in either /sbin or /usr/sbin. With this
patch, the ztest will try to call zdb in the following order.
1. Use environmental variable ZDB_PATH if provided.
2. If ztest resides in build tree, guess the in tree zdb path.
3. Just pass zdb to popen and let it search it in PATH.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3126
This was originally in fe0ed8f910, but somehow
was changed and not working anymore. And it will cause the following error:
modprobe: ERROR: ../libkmod/libkmod.c:506 lookup_builtin_file() could not open builtin file '/lib/modules/4.2.0-18-generic/modules.builtin.bin'
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4027
This was originally in e80cd06b8e, but somehow
was changed and not working anymore. And it will cause the following error:
modprobe: ERROR: ../libkmod/libkmod.c:506 lookup_builtin_file() could not open builtin file '/lib/modules/4.2.0-18-generic/modules.builtin.bin'
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#501
Useful when looking for the info on ZFS/SPL related memory consumption.
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#460
Replace DKIOCTRIM with DKIOCFREE and add additional support required
for Nextenta's TRIM support.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#469
This is needed for architectures that do not have a builtin prefetchw()
Signed-off-by: Dimitri John Ledkov <xnox@ubuntu.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#502
Adding VPATH support, commit 47a4a6f, required that a `src`
and `obj` line be added to the top of the Makefiles. They
must be removed from the Makefiles when builtin.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue zfsonlinux/spl#481
Issue zfsonlinux/spl#498
Adding VPATH support, commit 37d7cd9, required that a `src`
and `obj` line be added to the top of the Makefiles. They
must be removed from the Makefiles when builtin.
The code which adds the `spl/` directory to the top level
Makefile was failing due to the addition of the `certs/` path.
The search pattern has been adjusted to be more tolerant.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #481
Issue #498
The xattr_hander->{list,get,set} were changed to take a xattr_handler,
and handler_flags argument was removed and should be accessed by
handler->flags.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4021
As part of block polling support in Linux 4.4, make_request_fn should
return a cookie value of type blk_qc_t. For now, we make zvol_request
always return BLK_QC_T_NONE until we assess whether and how we want
to support block polling.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4021
On 32 bit, the calculation of zfs_dirty_data_max from phymem will overflow,
causing it to be smaller than zfs_dirty_data_sync, and will cause txg being
delayed while no one write to disk. The end result is horrendous write speed.
On 4G ram 32-bit VM, before this patch, simple dd results in ~7MB/s. Now it
can reach speed on par with 64-bit VM.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3973
On 32 bit system, zio_buf_cache is limit to 1M. Larger than that is all NULL.
So we need to avoid reaping them.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3973
When concurrent threads accessing the snapdir, one will succeed the user
helper mount while others will get EBUSY. However, the original code treats
those EBUSY threads as success and goes on to do zfsctl_snapshot_add, which
causes repeated avl_add and thus panic.
Also, if the snapshot is already mounted somewhere else, a thread accessing
the snapdir will also get EBUSY from user helper mount. And it will cause
strange things as doing follow_down_one will fail and then follow_up will jump
up to the mountpoint of the filesystem and confuse the hell out of VFS.
The patch fix both behavior by returning 0 immediately for the EBUSY threads.
Note, this will have a side effect for the second case where the VFS will
retry several times before returning ELOOP.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4018
The original P2ROUNDUP and P2ROUNDUP_TYPED macros contain -x which
triggers PaX's integer overflow detection for unsigned integers.
Replace the macros with an equivalent version that does not trigger
the overflow.
Axioms:
A. (-(x)) === (~((x) - 1)) === (~(x) + 1) under two's complement.
B. ~(x & y) === ((~(x)) | (~(y))) under De Morgan's law.
C. ~(~x) === x under the law of excluded middle.
Proof:
0. (-(-(x) & -(align))) original
1. (~(-(x) & -(align)) + 1) by A
2. (((~(-(x))) | (~(-(align)))) + 1) by B
3. (((~(~((x) - 1))) | (~(~((align) - 1)))) + 1) by A
4. (((((x) - 1)) | (((align) - 1))) + 1) by C
Q.E.D.
Signed-off-by: Jason Zaman <jason@perfinion.com>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3949
Allow the following environment variables to control the build
behavior of the zimport.sh script. This can be useful when you
want a debug build or require specific build options. The
default values are:
CONFIG_OPTIONS=""
MAKE_OPTIONS="-s -j$(nproc)"
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Because errors during module load are so rare it went unnoticed that
it was possible that a positive errno was returned. This would result
in the module being loaded, nothing being initialized, and a system
panic shortly thereafter. This is what was causing the hard failures
in the automated testing.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Limit the maximum object size to 1/128 of total system memory for
the kmem cache tests. Large values can result in out of memory errors
for systems with less the 512M of memory. Additionally, use the
known number of objects per-slab for calculating the number of
objects to use for a test.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When decreasing the maximum ARC size preserve the 3/4 default
ratio for the arc_meta_limit. Otherwise, the arc_meta_limit
may be set the same as arc_max.
Signed-off-by: AndCycle <andcycle@andcycle.idv.tw>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4001
The original P2ROUNDUP and P2ROUNDUP_TYPED macros contain -x which
triggers PaX's integer overflow detection for unsigned integers.
Replace the macros with an equivalent version that does not trigger
the overflow.
Axioms:
A. (-(x)) === (~((x) - 1)) === (~(x) + 1) under two's complement.
B. ~(x & y) === ((~(x)) | (~(y))) under De Morgan's law.
C. ~(~x) === x under the law of excluded middle.
Proof:
0. (-(-(x) & -(align))) original
1. (~(-(x) & -(align)) + 1) by A
2. (((~(-(x))) | (~(-(align)))) + 1) by B
3. (((~(~((x) - 1))) | (~(~((align) - 1)))) + 1) by A
4. (((((x) - 1)) | (((align) - 1))) + 1) by C
Q.E.D.
Signed-off-by: Jason Zaman <jason@perfinion.com>
Reviewed-by: Chris Dunlop <chris@onthe.net.au>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#2505
Closes#488
Currently taskq_dispatch() will spawn new task with a condition that the caller
is also a member of the taskq. However, under this condition, it will still
cause deadlock where a task on tq1 is waiting another thread, who is trying to
dispatch a task on tq1. So this patch removes the check.
For example when you do:
zfs send pp/fs0@001 | zfs recv pp/fs0_copy
This will easily deadlock before this patch.
Also, move the seq_task check from taskq_thread_spawn() to taskq_thread()
because it's not used by the caller from taskq_dispatch().
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#496
Linux slab will automatically free empty slab when number of partial slab is
over min_partial, so we don't need to explicitly shrink it. In fact, calling
kmem_cache_shrink from shrinker will cause heavy contention on
kmem_cache_node->list_lock, to the point that it might cause __slab_free to
livelock (see zfsonlinux/zfs#3936)
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/zfs#3936
Closes#487
The TEST file is provided as a hint to the automated test infra-
structure. It controls which regression tests are run and how they
are run. This file along with any lines in the commit messages
which start with TEST_* are sourced by the test scripts and can
be used to override the default values. For complete details see:
https://github.com/zfsonlinux/zfs-buildbot/
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of gcc 5.1.1 20150618 (Red Hat 5.1.1-4) the -Werror=maybe-uninitialized
check detects that 'snapname' in recv_incremental_replication() may not be
initialized. Explicitly initialize the variable to resolved the warning.
libzfs_sendrecv.c: In function ‘recv_incremental_replication’:
libzfs_sendrecv.c:2019:2: error: ‘snapname’ may be used uninitialized in
(void) snprintf(buf, sizeof (buf), "%s@%s", fsname, snapname);
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When sa_bulk_lookup() fails, unlock_new_inode() will spit out a WARNING. It
will also recursive deadlock on ZFS_OBJ_HOLD_ENTER in zfs_zinactive().
Since we never call insert_inode_locked in fail path, I_NEW is never set, the
inode is never hashed. So unlock_new_inode() can be safely remove it.
We set z_sa_hdl to NULL in fail path so that iput path will stop at
zfs_inactive() without entering zfs_zinactive(). This way we can avoid the
deadlock and prevent double sa_handle_destroy().
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3899
Currently, vdev_disk_physio_completion will try to wake up an waiter without
first checking the existence. This creates a race window in which complete is
called after dr is freed.
We add dr_wait in dio_request to indicate the existence of waiter. Also,
remove dr_rw since no one is using it, and reorder dr_ref to make the struct
more compact in 64bit.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3917
Issue #3880
This change modifies the import service to use the default cache file
to perform a verbatim import of pools at boot. This fixes code that
searches all devices and imported all visible pools.
Using the cache file is in keeping with the way ZFS has always worked,
how Solaris, Illumos, FreeBSD, and systemd performs imports, and is how
it is written in the man page (zpool(1M,8)):
All pools in this cache are automatically imported when the
system boots.
Importantly, the cache contains important information for importing
multipath devices, and helps control which pools get imported in more
dynamic environments like SANs, which may have thousands of visible
and constantly changing pools, which the ZFS_POOL_EXCEPTIONS variable
is not equipped to handle. Verbatim imports prevent rogue pools from
being automatically imported and mounted where they shouldn't be.
The change also stops the service from exporting pools at shutdown.
Exporting pools is only meant to be performed explicitly by the
administrator of the system.
The old behavior of searching and importing all visible pools is
preserved and can be switched on by heeding the warning and toggling
the ZPOOL_IMPORT_ALL_VISIBLE variable in /etc/default/zfs.
Signed-off-by: James Lee <jlee@thestaticvoid.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3777Closes#3526
Avoid buffer overrun on all-zero bpobj subobjects by using signed
array index. Also fix the type cast on the printf() argument.
Signed-off-by: Tim Chase <tim@onlight.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3905
EDOM may occur if a user tries to set `recordsize` too large without
use "zfs set". This can be demonstrated with:
> zpool create testpool -O recordsize=32M /dev/...
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3911
Allocate a kmem cache magazine for every possible CPU which might
be added to the system. This ensures that when one of these CPUs
is enabled it can be safely used immediately.
For many systems the number of online CPUs is identical to the
number of present CPUs so this does imply an increased memory
footprint. In fact, dynamically allocating the array of magazine
pointers instead of using the worst case NR_CPUS can end up
decreasing our memory footprint.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#482
Strictly enforce keeping 'arc_c >= arc_c_min'. The ASSERTs are
left in place to catch this in a debug build but logic has been
added to gracefully handle in a production build.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3904
For consistency all systemd unit files and init scripts now share
the same names. This prevents an issue where the zed is started
twice on systems where both the systemd and sysv infrastructure is
installed concurrently.
For backward compatibility a 'zed' alias has been added. This
allows the user to interact with the service using either the
name 'zed' or 'zfs-zed'.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3837
If CONFIG_RWSEM_SPIN_ON_OWNER is defined, rw_semaphore will have an owner
field, so we don't need our own.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #473
The spin lock around rw_owner is completely unnecessary. The reason is that it
is only modified in the down_write context. If you race against another thread
modifying it, that means that you aren't holding the rwlock, so taking the
spin lock don't eliminate the race.
Also, we only check rw_owner in RW_WRITE_HELD because spl_rwsem_is_locked
is unnecessary and might need to take spin lock.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #473
Modern versions of dkms cleanup the build directory after installing.
This resulted in 'dkms uninstall' never running because the check
added by commit 866c162 which verifies the existance of the
zfs.release build product would never be true.
This patch resolves the issue by updating the conditional to check
in the explicitly installed zfs_config.h file for the version.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3862
Modern versions of dkms cleanup the build directory after installing.
This resulted in 'dkms uninstall' never running because the check
added by commit 4cdcdbf which verifies the existence of the
spl.release build product would never be true.
This patch resolves the issue by updating the conditional to check
in the explicitly installed spl_config.h file for the version.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#478
We should never block when holding a spin lock, but zfs_inode_update can
block in the critical section of a spin lock in zfs_inode_update:
zfs_inode_update -> dmu_object_size_from_db -> zrl_add -> mutex_enter
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3858
All users of zv_lock were removed by 37f9dac, but we forgot to remove
it. Lets remove it as clean up.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3858
* Fix regression - "OVERLAY_MOUNTS" should have been "DO_OVERLAY_MOUNTS".
* Fix update-rc.d commands in postinst. Thanx to subzero79@GitHub.
* Fix make sure a filesystem exists before trying to mount in mount_fs()
* Fix local variable usage.
* Fix to read_mtab():
* Strip control characters (space - \040) from /proc/mounts GLOBALY,
not just first occurrence.
* Don't replace unprintable characters ([/-. ]) for use in the variable
name with underscore. No need, just remove them all together.
* Add check_boolean() to check if a user configure option is
set ('yes', 'Yes', 'YES' or any combination there of) OR '1'.
Anything else is considered 'unset'.
* Add a ZFS_POOL_IMPORT to the default config.
* This is a semi colon separated list of pools to import ONLY.
* This is intended for systems which have _a lot_ of pools (from
a SAN for example) and it would be to many to put in the
ZFS_POOL_EXCEPTIONS variable..
* Add a config option "ZPOOL_IMPORT_OPTS" for adding additional options
to "zpool import".
* Add documentation and the chance of overriding the ZPOOL_CACHE
variable in the config file.
* Remove "sort" from find_pools() and setup_snapshot_booting().
Sometimes not available, and not really necessary.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3816
When doing uioskip to skip an iovec to the very end, the current loop
condition will falsely check pass the end of iovec. We fix this checking
uio_iovcnt first.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3806Closes#3850
Support grsecurity/PaX kernel configurations where
CONFIG_PAX_USERCOPY_SLABS are enabled. When this kernel option
is enabled slabs which are used to copy between user and kernel
space must be created with SLAB_USERCOPY.
Stock Linux kernels do not have a SLAB_USERCOPY definition so
this causes no change in behavior for non-PAX-enabled kernels.
Verified-by: Wuffleton <null@wuffleton.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2977
Issue #3796
Userspace can trigger an assertion by passing a zero-length segment
when assertions are enabled:
[27961.614792] VERIFY3(skip < iov->iov_len) failed (0 < 0)
[27961.614795] PANIC at zfs_uio.c:187:uio_prefaultpages()
[27961.614805] Call Trace:
[27961.614811] dump_stack+0x45/0x57
[27961.614830] spl_dumpstack+0x44/0x50 [spl]
[27961.614834] spl_panic+0xbb/0x100 [spl]
[27961.614908] uio_prefaultpages+0x134/0x140 [zcommon]
[27961.614930] zfs_write+0x1fd/0xe80 [zfs]
[27961.615014] zpl_write_common_iovec+0x7f/0x110 [zfs]
[27961.615035] zpl_iter_write+0xa0/0xd0 [zfs]
[27961.615037] do_iter_readv_writev+0x59/0x80
[27961.615063] do_readv_writev+0x11b/0x260
[27961.615098] vfs_writev+0x39/0x50
[27961.615100] SyS_writev+0x4a/0xe0
[27961.615103] system_call_fastpath+0x16/0x6e
The solution is to delete the assertion. This could potentially
occur in uiomove as well, which contains analogous assertions
that appear similarly unnecessary, so we remove those as well.
Reported-by: Jonathan Vasquez <jvasquez1011@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3792