Authored by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
The refresh_config() calls into the kernel with ZFS_IOC_POOL_TRYIMPORT.
This ioctl returns the config of the pool in a buffer pre-allocated in
userland. The original estimate for the size is too conservative since
it doesn't account for the large size of vdev stats that are added to
the config before returning.
This fix simply increases the size of the buffer passed. This results in
a speed up of the zpool import process, and less spam in zfs_dbgmsg.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>
OpenZFS-issue: https://www.illumos.org/issues/7541
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a3c7690Closes#5704
Accidentally introduced by 4ea3f86. The BEGIN CSTYLE block cannot
appear half way through a continued #define.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5643Closes#5644
When doing recv and rollback, dsl_dataset_clone_swap_sync_impl will be
called to swap out the ds_objset and do dmu_objset_evict on the old one.
However, currently zv->zv_objset will not be swapped out accordingly, so
if anyone currently holds a fd on the zvol, we risk hitting a use-after-free.
We fix this by introducing the suspend and resume mechanism of zsb to
zv. Before recv or rollback, we use zvol_suspend to block all access to
zv_objset and shut it down. After the recv or rollback, we use zvol_resume
to swap in zv_objset with the new ds_objset and unblock the access.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4866Closes#5609
Add *_by_dnode() routines for accessing objects given their
dnode_t *, this is more efficient than accessing the object by
(objset_t *, uint64_t object). This change converts some but
not all of the existing consumers. As performance-sensitive
code paths are discovered they should be converted to use
these routines.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Closes#5534
Issue #4802
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Joe Stein <jas14@cs.brown.edu>
Ported-by: Don Brady <don.brady@intel.com>
When loading a pool that had been created before the existance of
per-vdev zaps, on a system that knows about per-vdev zaps, the
per-vdev zaps will not be allocated and initialized.
This appears to be because the logic that would have done so, in
spa_sync_config_object(), is not reached under normal operation. It is
only reached if spa_config_dirty_list is non-empty.
The fix is to add another `AVZ_ACTION_` enum that will allow this code
to be reached when we detect that we're loading an old pool, even when
there are no dirty configs.
OpenZFS-issue: https://www.illumos.org/issues/7743
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/e2d29d0Closes#5582
[bio] The req_op enum was changed to req_opf. Update the "Linux 4.8 API"
autotools checks to use an int to determine whether the various REQ_OP
values are defined. This should work properly on kernels >= 4.8.
[bio] bio_set_op_attrs() is now an inline function and can't be detected
with #ifdef. Add a configure check to determine whether bio_set_op_attrs()
is defined. Move the local definition of it from vdev_disk.c to
blkdev_compat.h for consistency with other related compability shims.
[bio] The read/write flags and their modifiers, including WRITE_FLUSH,
WRITE_FUA and WRITE_FLUSH_FUA have been removed from fs.h. Add the new
bio_set_flush() compatibility wrapper to replace VDEV_WRITE_FLUSH_FUA
and set the flags appropriately for each supported kernel version.
[vfs] The generic_readlink() function has been made static. If .readlink
in inode_operations is NULL, generic_readlink() is used.
[zol typo] Completely unrelated to 4.10 compat, fix a typo in the check
for REQ_OP_SECURE_ERASE so that the proper macro is defined:
s/HAVE_REQ_OP_SECURE_DISCARD/HAVE_REQ_OP_SECURE_ERASE/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#5499
The introduction of parallel zvol prefetch causes deadlock when using
vdev_file.
spa_async->(spa_namespace_lock)->txg_wait_synced->(wait for txg_sync)
txg_sync->zio_wait->(wait for vdev_file_io_fsync on system_taskq)
zvol_prefetch_minors_impl (on system_taskq)->spa_open_common->(wait for spa_namespace_lock)
We fix this by using dedicated taskq for vdev_file. This same change
was originally made in commit bc25c93 but reverted in commit aa9af22
when dynamic taskqs were added.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#5506Closes#5495
The main complication from the RT patch set is that the RW semaphore
locks change such that read locks on an rwsem can be taken only by
a single thread. All other threads are locked out. This single
thread can take a read lock multiple times though. The underlying
implementation changes to a mutex with an additional read_depth
count.
The implementation can be best understood by inspecting the RT
patch. rwsem_rt.h and rt.c give the best insight into how RT
rwsem works. My implementation for rwsem_tryupgrade is basically
an inversion of rt_downgrade_write found in rt.c. Please see the
comments in the code.
Unfortunately, I have to drop SPLAT rwlock test4 completely as this
test tries to take multiple locks from different threads, which RT
rwsems do not support. Otherwise SPLAT, zconfig.sh, zpios-sanity.sh
and zfs-tests.sh pass on my Debian-testing VM with the kernel
linux-image-4.8.0-1-rt-amd64.
Tested-by: kernelOfTruth <kerneloftruth@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Clemens Fruhwirth <clemens@endorphin.org>
Closeszfsonlinux/zfs#5491Closes#589Closes#308
Enable picky cstyle checks and resolve the new warnings. The vast
majority of the changes needed were to handle minor issues with
whitespace formatting. This patch contains no functional changes.
Non-whitespace changes are as follows:
* 8 times ; to { } in for/while loop
* fix missing ; in cmd/zed/agents/zfs_diagnosis.c
* comment (confim -> confirm)
* change endline , to ; in cmd/zpool/zpool_main.c
* a number of /* BEGIN CSTYLED */ /* END CSTYLED */ blocks
* /* CSTYLED */ markers
* change == 0 to !
* ulong to unsigned long in module/zfs/dsl_scan.c
* rearrangement of module_param lines in module/zfs/metaslab.c
* add { } block around statement after for_each_online_node
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5465
Speed up import and export speed by:
* Add system delay taskq
* Parallel prefetch zvol dnodes during zvol_create_minors
* Parallel zvol_free during zvol_remove_minors
* Reduce list linear search using ida and hash
Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#5433
Add a dedicated system_delay_taskq for long delay like spa_deadman and
zpl_posix_acl_free. This will allow us to use system_taskq in the manner of
dispatch multiple tasks and call taskq_wait_outstanding.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#588
Save and reuse ddt dspace calculation when there have been no ddt changes.
This avoids unnecessary traversal of 168KiB of ddt histograms.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes#5425
It was observed that even when the txg history is disabled by
setting `zfs_txg_history=0` the txg_sync thread still fetches
the vdev stats unnecessarily.
This patch refactors the code such that vdev_get_stats() is no
longer called when `zfs_txg_history=0`. And it further reduces
the differences between upstream and the ZoL txg_sync_thread()
function.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5412
It looks like this was functionality which was added in the
original SA implementation and then never needed. It can
be safely removed now and easily added back if we find a
use for it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes#5440
zio.h includes zio_impl.h but zio_impl.h also includes zio.h, so the
header files to contain each other. Get rid of the zio_impl.h include
in zio.h and update zio_inject.c to include zio.h instead of zio_impl.h.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes#5439
Use it for spa_deadman, zpl_posix_acl_free, snapentry_expire.
This free system_taskq from the above long delay tasks, and allow us to do
taskq_wait_outstanding on system_taskq without being blocked forever, making
system_taskq more generic and useful.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
* Convert ABD to use the Linux Kernel scatterlist implementation
instead of the hand rolled one from illumos.
* Scatter ABDs are preferentially populated with higher order
compound pages from a single zone. Allocation size is
progressively decreased until it can be satisfied without
performing reclaim or compaction.
* An alternate page allocator is provided for kernels older
than 3.6 and for CONFIG_HIGHMEM systems. This allocator
is designed as a fallback for maximum compatibility.
* Extended abdstats to provide visibility in the the allocator.
* Add cached value for PAGESIZE in userspace.
Contributions-by:
Chunwei Chen <david.chen@osnexus.com>
Gvozden Neskovic <neskovic@gmail.com>
Jinshan Xiong <jinshan.xiong@intel.com>
Isaac Huang <he.huang@intel.com>
David Quigley <david.quigley@intel.com>
Brian Behlendorf <behlendorf1@llnl.gov>
* userspace: aligned buffers. Minimum of 32B alignment is
needed for AVX2. Kernel buffers are aligned 512B or more.
* add abd_get_offset_size() interface
* abd_iter_map(): fix calculation of iter_mapsize
* add abd_raidz_gen_iterate() and abd_raidz_rec_iterate()
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
This patch adds a command (-c) option to zpool status and zpool iostat. The
-c option allows you to run an arbitrary command on each vdev and display
the first line of output in zpool status/iostat. The environment vars
VDEV_PATH and VDEV_UPATH are set to the vdev's path and "underlying path"
before running the command. For device mapper, multipath, or partitioned
vdevs, VDEV_UPATH is the actual underlying /dev/sd* disk. This can be useful
if the command you're running requires a /dev/sd* device.
The patch also uses /sys/block/<dev>/slaves/ to lookup the underlying device
instead of using libdevmapper. This not only removes the libdevmapper
requirement at build time, but also allows you to resolve device mapper
devices without being root. This means that UDEV_UPATH get set correctly
when running zpool status/iostat as an unprivileged user.
Example:
$ zpool status -c 'echo I am $VDEV_PATH, $VDEV_UPATH'
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
mpatha ONLINE 0 0 0 I am /dev/mapper/mpatha, /dev/sdc
sdb ONLINE 0 0 0 I am /dev/sdb1, /dev/sdb
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#5368
Allow `zfs unshare <protocol> -a` command to share or unshare all datasets
of a given protocol, nfs or smb.
Additionally, enable most of ZFS Test Suite zfs_share/zfs_unshare test cases.
To work around some Illumos-specific functionalities ($SHARE/$UNSHARE) some
function wrappers were added around them.
Finally, fix and issue in smb_is_share_active() that would leave SMB shares
exported when invoking 'zfs unshare -a'
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#3238Closes#5367
CID 147540: unsigned_compare
- Cast nsec to a int32_t to properly detect the expected overflow.
CID 147542: unsigned_compare
- intval can never be less than ZIO_FAILURE_MODE_WAIT which is
defined to be zero. Remove this useless check.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes#5379
It's used by Lustre to determine if the objset can be upgraded.
The inline version doesn't work because dmu_objset_is_snapshot()
is not exported.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Closes#5385
Linux 3.14 introduces inode->set_acl(). Normally, acl modification will come
from setxattr, which will handle by the acl xattr_handler, and we already
handles that well. However, nfsd will directly calls inode->set_acl or
return error if it doesn't exists.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Massimo Maggi <me@massimo-maggi.eu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#5371Closes#5375
Originally, these two function are inline, so their usability is tied to
posix_acl_release. However, since Linux 3.14, they became EXPORT_SYMBOL, so we
can always use them. In this patch, we create an independent test for these
two functions so we can use them when possible.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>