mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-01-25 10:12:13 +03:00

Author	SHA1	Message	Date
Tony Hutter	4a283c7f77	Force fault a vdev with 'zpool offline -f' This patch adds a '-f' option to 'zpool offline' to fault a vdev instead of bringing it offline. Unlike the OFFLINE state, the FAULTED state will trigger the FMA code, allowing for things like autoreplace and triggering the slot fault LED. The -f faults persist across imports, unless they were set with the temporary (-t) flag. Both persistent and temporary faults can be cleared with zpool clear. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #6094	2017-05-19 12:30:16 -07:00
Tom Caputi	a32df59e18	Fixed small memory leak in ereport handling One pre-check in zfs_ereport_start() was being called after the nvlists were being allocated. This simply corrects that issue. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #6140	2017-05-18 17:35:49 -07:00
Boris Protopopov	5559ba094f	Introduce zv_state_lock The lock is designed to protect internal state of zvol_state_t and to avoid taking spa_namespace_lock (e.g. in dmu_objset_own() code path) while holding zvol_stat_lock. Refactor the code accordingly. Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3484 Closes #6065 Closes #6134	2017-05-16 19:44:06 -04:00
Boris Protopopov	07783588bc	Revert commit `1ee159f4` Fix lock order inversion with zvol_open() as it did not account for use of zvols as vdevs. The latter use cases resulted in the lock order inversion deadlocks that involved spa_namespace_lock and bdev->bd_mutex. Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #6065 Issue #6134	2017-05-16 19:42:47 -04:00
Isaac Huang	3d6da72d18	Skip spurious resilver IO on raidz vdev On a raidz vdev, a block that does not span all child vdevs, excluding its skip sectors if any, may not be affected by a child vdev outage or failure. In such cases, the block does not need to be resilvered. However, current resilver algorithm simply resilvers all blocks on a degraded raidz vdev. Such spurious IO is not only wasteful, but also adds the risk of overwriting good data. This patch eliminates such spurious IOs. Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Isaac Huang <he.huang@intel.com> Closes #5316	2017-05-12 17:28:03 -07:00
Matthew Ahrens	4747a7d3d4	OpenZFS 8063 - verify that we do not attempt to access inactive txg Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> A standard practice in ZFS is to keep track of "per-txg" state. Any of the 3 active TXG's (open, quiescing, syncing) can have different values for this state. We should assert that we do not attempt to modify other (inactive) TXG's. Porting Notes: - ASSERTV added to txg_sync_waiting() for unused variable. OpenZFS-issue: https://www.illumos.org/issues/8063 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/01acb46 Closes #6109	2017-05-10 13:52:22 -04:00
Matthew Ahrens	335b251ac1	OpenZFS 8166 - zpool scrub thinks it repaired offline device Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Matthew Ahrens <mahrens@delphix.com> If we do a scrub while a leaf device is offline (via "zpool offline"), we will inadvertently clear the DTL (dirty time log) of the offline device, even though it is still damaged. When the device comes back online, we will incompletely resilver it, thinking that the scrub repaired blocks written before the scrub was started. The incomplete resilver can lead to data loss if there is a subsequent failure of a different leaf device. The fix is to never clear the DTL of offline devices. Note that if a device is onlined while a scrub is in progress, the scrub will be restarted. The problem can be worked around by running "zpool scrub" after "zpool online". OpenZFS-issue: https://www.illumos.org/issues/8166 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/372 Closes #5806 Closes #6103	2017-05-10 10:32:39 -07:00
Tom Caputi	f486f58440	Add missing arc_free_cksum() to arc_release() The arc layer tracks checksums of its data in the arc header so that it can ensure that buffers haven't changed when they're not supposed to. This checksum is only maintained while there is an uncompressed buffer still attached to the header. Unfortunately there is a missing call to arc_free_cksum() in arc_release() that can trigger ASSERTs. This has not been a common issue because the checksums are only maintained for debug builds and triggering the bug requires writing a block (and therefore calling arc_release()) while a compressed buffer is still being used on a debug build. This simply corrects the issue. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #6105	2017-05-10 10:25:27 -07:00
Brian Behlendorf	2946a1a15a	Linux 4.12 compat: CURRENT_TIME removed Linux 4.9 added current_time() as the preferred interface to get the filesystem time. CURRENT_TIME was retired in Linux 4.12. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6114	2017-05-10 09:30:48 -07:00
LOLi	a3eeab2de6	Add property overriding (-o\|-x) to 'zfs receive' This allows users to specify "-o property=value" to override and "-x property" to exclude properties when receiving a zfs send stream. Both native and user properties can be specified. This is useful when using zfs send/receive for periodic backup/replication because it lets users change properties such as canmount, mountpoint, or compression without modifying the source. References: https://www.illumos.org/issues/2745 https://www.illumos.org/issues/3753 Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #1350 Closes #5349	2017-05-09 16:21:09 -07:00
Chunwei Chen	e624cd1959	Linux 4.12 compat: PF_FSTRANS was removed zfsonlinux/spl@8f87971 added __spl_pf_fstrans_check for the xfs related check, so we use them accordingly. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #6113	2017-05-09 10:38:46 -07:00
Brian Behlendorf	1eab430af7	Fix unused variable warning Remove the lz4_ac local variable from dmu_write_policy() to resolve the following unused variable warning on non-debug builds. dmu.c: In function ‘dmu_write_policy’: dmu.c:1892:12: warning: unused variable ‘lz4_ac’ [-Wunused-variable] boolean_t lz4_ac = spa_feature_is_active(os->os_spa, Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2017-05-05 10:23:58 -07:00
Gvozden Neskovic	c17486b217	Add missing _destroy/_fini calls The proposed debugging enhancements in zfsonlinux/spl#587 identified the following missing _destroy/_fini calls. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #5428	2017-05-04 19:26:28 -04:00
Brian Behlendorf	8fa5250f5d	Default to zvol_request_async=0 Change the default ZVOL behavior so requests are handled asynchronously. This behavior is functionally the same as in the zfs-0.6.4 release. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5902	2017-05-04 18:01:50 -04:00
Richard Yao	bc17f1047a	Enable Linux read-ahead for a single page on ZVOLs Linux has read-ahead logic designed to accelerate sequential workloads. ZFS has its own read-ahead logic called zprefetch that operates on both ZVOLs and datasets. Having two prefetchers active at the same time can cause overprefetching, which unnecessarily reduces IOPS performance on CoW filesystems like ZFS. Testing shows that entirely disabling the Linux prefetch results in a significant performance penalty for reads while commensurate benefits are seen in random writes. It appears that read-ahead benefits are inversely proportional to random write benefits, and so a single page of Linux-layer read-ahead appears to offer the middle ground for both workloads. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Issue #5902	2017-05-04 18:00:27 -04:00
RageLtMan	5731140eaf	Disable write merging on ZVOLs The current ZVOL implementation does not explicitly set merge options on ZVOL device queues, which results in the default merge behavior. Explicitly set QUEUE_FLAG_NOMERGES on ZVOL queues allowing the ZIO pipeline to do its work. Initial benchmarks (tiotest with no O_DIRECT) show random write performance going up almost 3X on 8K ZVOLs, even after significant rewrites of the logical space allocation. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: RageLtMan <rageltman@sempervictus> Issue #5902	2017-05-04 17:59:52 -04:00
Olaf Faaland	9d3f7b8791	Write label 2,3 uberblocks when vdev expands When vdev_psize increases, the location of labels 2 and 3 changes because their location is relative to the end of the device. The configs for labels 2 and 3 are written during the next spa_sync() because the vdev is added to the dirty config list. However, the uberblock rings are not re-written in their new location, leaving the device vulnerable to the beginning of the device being overwritten or damaged. This patch copies the uberblock ring from label 0 to labels 2 and 3, in their new locations, at the next sync after vdev_psize increases. Also, add a test zpool_expand_004_pos.ksh to confirm the uberblocks are copied. Reviewed-by: BearBabyLiu <liu.huang@zte.com.cn> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #5108	2017-05-02 13:55:24 -07:00
Debabrata Banerjee	03b60eee78	Allow scaling of arc in proportion to pagecache When multiple filesystems are in use, memory pressure causes arc_cache to collapse to a minimum. Allow arc_cache to maintain proportional size even when hit rates are disproportionate. We do this only via evictable size from the kernel shrinker, thus it's only in effect under memory pressure. AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Closes #6035	2017-05-02 15:50:49 -04:00
Debabrata Banerjee	4149bf498a	Correct signed operation Could return the wrong pages value AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035	2017-05-02 15:50:26 -04:00
Debabrata Banerjee	44813aefad	Don't run the reaper if we didn't shrink the cache Calling it when nothing is evictable will cause extra kswapd cpu. Also if we didn't shrink it's unlikely to have memory to reap because we likely just called it microseconds ago. The exception is if we are in direct reclaim. You can see how hard this is being hit in kswapd with a light test workload: 34.95% [zfs] [k] arc_kmem_reap_now 5.40% [spl] [k] spl_kmem_cache_reap_now 3.79% [kernel] [k] _raw_spin_lock 2.86% [spl] [k] __spl_kmem_cache_generic_shrinker.isra.7 2.70% [kernel] [k] shrink_slab.part.37 1.93% [kernel] [k] isolate_lru_pages.isra.43 1.55% [kernel] [k] __wake_up_bit 1.20% [kernel] [k] super_cache_count 1.20% [kernel] [k] __radix_tree_lookup With ZFS just mounted but only ext4/pagecache memory pressure arc_kmem_reap_now still consumes excessive CPU: 12.69% [kernel] [k] isolate_lru_pages.isra.43 10.76% [kernel] [k] free_pcppages_bulk 7.98% [kernel] [k] drop_buffers 7.31% [kernel] [k] shrink_page_list 6.44% [zfs] [k] arc_kmem_reap_now 4.19% [kernel] [k] free_hot_cold_page 4.00% [kernel] [k] __slab_free 3.95% [kernel] [k] __isolate_lru_page 3.09% [kernel] [k] __radix_tree_lookup Same pagecache only workload as above with this patch series: 11.58% [kernel] [k] isolate_lru_pages.isra.43 11.20% [kernel] [k] drop_buffers 9.67% [kernel] [k] free_pcppages_bulk 8.44% [kernel] [k] shrink_page_list 4.86% [kernel] [k] __isolate_lru_page 4.43% [kernel] [k] free_hot_cold_page 4.00% [kernel] [k] __slab_free 3.44% [kernel] [k] __radix_tree_lookup (arc_kmem_reap_now has 0 samples in perf) AKAMAI: zfs: CR 3695042 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035	2017-05-02 15:50:13 -04:00
Debabrata Banerjee	1a31dcf53c	Only wakeup waiters if we've actually done work AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035	2017-05-02 15:50:02 -04:00
Debabrata Banerjee	2e91c2fb1a	Do not stop kernel shrinker on lock contention Lock contention, by itself, shouldn't indicate a stop condition to the kernel's slab shrinker. Doing so can cause stalls when the kernel is trying to free large parts of the cache such as is done by drop_caches Also, perhaps arc_reclaim_lock should be a spinlock, and this code eliminated. AKAMAI: zfs: CR 3593801 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035	2017-05-02 15:49:48 -04:00
Debabrata Banerjee	b855550c33	Stop double reclaiming or not reclaiming at all Move arcstat_need_free increment from all direct calls to when arc_reclaim_lock is busy and we exit wihout doing anything. Data will be reclaimed in reclaim thread. The previous location meant that we both reclaim the memory in this thread, and also schedule the same amount of memory for reclaim in arc_reclaim, effectively doubling the requested reclaim. AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035	2017-05-02 15:49:36 -04:00
Debabrata Banerjee	30fffb9021	Make arc_need_free updates atomic Ensures proper accounting of bytes we requested to free AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035	2017-05-02 15:48:49 -04:00
Debabrata Banerjee	9b50146dc4	Don't report ghost buffers as evictable mem Ghost meta/data buffers are not actually allocated AKAMAI: zfs: CR 3695072 Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Issue #6035	2017-05-02 15:47:23 -04:00
jxiong	2b91b5119c	minor improvement to abd_free_pages() It doesn't need to have a loop to free page in a single scatterlist entry because it should be single or compound page. The pages can be freed in one invocation to __free_pages() for both cases. Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com> Closes #6057	2017-05-02 10:06:18 -07:00
jxiong	24fa20340d	Guarantee PAGESIZE alignment for large zio buffers In current implementation, only zio buffers in 16KB and bigger are guaranteed PAGESIZE alignment. This breaks Lustre since it assumes that 'arc_buf_t::b_data' must be page aligned when zio buffers are greater than or equal to PAGESIZE. This patch will make the zio buffers to be PAGESIZE aligned when the sizes are not less than PAGESIZE. This change may cause a little bit memory waste but that should be fine because after ABD is introduced, zio buffers are used to hold data temporarily and live in memory for a short while. Reviewed-by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jinshan Xiong <jinshan.xiong@gmail.com> Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com> Closes #6084	2017-05-02 10:04:30 -07:00
Brian Behlendorf	7dae2c81e7	Linux 4.12 compat: super_setup_bdi_name() All filesystems were converted to dynamically allocated BDIs. The destruction of backing_dev_info structures is handled as part of super block destruction. Refactor the code to abstract away the details of creating and destroying a BDI. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6089	2017-05-02 09:46:18 -07:00
Yuri Pankov	153b228554	OpenZFS 7786 - zfs`vdev_online() needs better notification about state changes Authored by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Albert Lee <trisk@forkgnu.org> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: bunder2015 <omfgbunder@gmail.com> OpenZFS-issue: https://www.illumos.org/issues/7786 OpenZFS-commit: http://github.com/openzfs/openzfs/commit/db8498f Closes #6074	2017-05-01 16:24:37 -04:00
Brian Behlendorf	e99932f7de	Limit zfs_dirty_data_max_max to 4G Reinstate default 4G zfs_dirty_data_max_max limit. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6072 Closes #6081	2017-05-01 13:01:39 -07:00
Chunwei Chen	692e55b8fe	Reinstate zvol_taskq to fix aio on zvol Commit `37f9dac` removed the zvol_taskq for processing zvol requests. This was removed as part of switching to make_request_fn and was motivated by a concern at the time over dispatch latency. However, this also made all bio request synchronous, and caused serious performance issues as the bio submitter would wait for every bio it submitted, effectively making the IO depth 1. This patch reinstate zvol_taskq, and to make sure overlapped I/Os are ordered properly, we take range lock in zvol_request, and pass it along with bio to the I/O functions zvol_{write,discard,read}. In order to facilitate benchmarks a zvol_request_sync module option was added to switch between sync and async request handling. For the moment, the default behavior is synchronous but this is likely to change pending additional testing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5824	2017-04-26 13:54:40 -07:00
Dan Kimmel	a7004725d0	OpenZFS 7252 - compressed zfs send / receive OpenZFS 7252 - compressed zfs send / receive OpenZFS 7628 - create long versions of ZFS send / receive options Authored by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: David Quigley <dpquigl@davequigley.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed by: David Quigley <dpquigl@davequigley.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Ported-by: bunder2015 <omfgbunder@gmail.com> Ported-by: Don Brady <don.brady@intel.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting Notes: - Most of 7252 was already picked up during ABD work. This commit represents the gap from the final commit to openzfs. - Fixed split_large_blocks check in do_dump() - An alternate version of the write_compressible() function was implemented for Linux which does not depend on fio. The behavior of fio differs significantly based on the exact version. - mkholes was replaced with truncate for Linux. OpenZFS-issue: https://www.illumos.org/issues/7252 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5602294 Closes #6067	2017-04-26 12:31:43 -07:00
wli5	7a25f0891e	Change U16 to U32 due to atomic_inc_32_nv After run a long time with QAT compression, the variable "inst_num" is overflow by "atomic_inc_32_nv", which causes its neighbor variable overwritten. Change its definition from U16 to U32. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Weigang Li <weigang.li@intel.com> Closes #6051	2017-04-25 17:41:58 -07:00
Matthew Ahrens	a004338372	OpenZFS 8025 - dbuf_read() creates unnecessary zio_root() for bonus buf Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> dbuf_read() creates a zio_root() to track and wait for all the zio's that may happen as part of this call. However, if the blkptr_t for this buffer is NULL or a hole, we will not create any more zio's, so this zio_root() is unnecessary. This is always the case when calling dbuf_read() on a bonus buffer, because it has no blkptr (it's part of the containing dnode). For workloads that read a lot of bonus buffers (e.g. file creation and removal), creating and destroying these unnecessary zio's can decrease performance by around 3%. The fix is to only create/destroy the zio_root() in dbuf_read() if the blkptr is not NULL and not a hole. Porting Notes: - The error handling for when dbuf_read_impl() fails which was originally added in commit `5f6d0b6f5` has been preserved. OpenZFS-issue: https://www.illumos.org/issues/8025 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8ec5c7c Closes #6048	2017-04-24 10:44:19 -07:00
dbavatar	6e03ec4fa2	Fix lseek result when dnode is dirty Fixup commit `66aca24`. We should have equivalent return values as generic_file_llseek() and advance to end of file. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Tested-by: bunder2015 <omfgbunder@gmail.com> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Closes #6050 Closes #6053	2017-04-24 09:38:31 -07:00
Olaf Faaland	0091d66f4e	Correct lock ASSERTs in vdev_label_read/write The existing assertions in vdev_label_read() and vdev_label_write(), testing which config locks are held, are incorrect. The assertions test for locks which exceed what is required for safety. Both vdev_label_{read,write}() are changed to assert SCL_STATE is held as RW_READER or RW_WRITER. This is safe because: Changes to the vdev tree occur under SCL_ALL as RW_WRITER, via spa_vdev_enter() and spa_vdev_exit(). Changes to vdev state occur under SCL_STATE_ALL as RW_WRITER, via spa_vdev_state_enter() and spa_vdev_state_exit(). Therefore, the new assertions guarantee that the vdev cannot change out from under a zio, and I/O to a specified leaf vdev's label is safe. Furthermore, this is consistent with the SPA locking discussion in spa_misc.c, "For any zio operation that takes an explicit vdev_t argument ... zio_read_phys(), or zio_write_phys() ... SCL_STATE as reader suffices." Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #5983	2017-04-21 14:26:43 -07:00
DHE	06226b5936	Increase zfs_vdev_async_write_min_active to 2 Resilver operations frequently cause only a small amount of dirty data to be written to disk at a time, resulting in the IO scheduler to only issue 1 write at a time to the resilvering disk. When it is rotational media the drive will often travel past the next sector to be written before receiving a write command from ZFS, significantly delaying the write of the next sector. Raise zfs_vdev_async_write_min_active so that drives are kept fed during resilvering. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Issue #4825 Closes #5926	2017-04-14 14:03:44 -07:00
Matthew Ahrens	f6d4ce8e34	OpenZFS 8061 - sa_find_idx_tab can be declared more type-safely Authored by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> sa_find_idx_tab() is declared as taking and returning "void *" parameters. These can be declared to be the specific types. OpenZFS-issue: https://www.illumos.org/issues/8061 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4e64aff Closes #6017	2017-04-14 11:11:28 -07:00
Andriy Gapon	87a275d97a	OpenZFS 6101 - attempt to lzc_create() a filesystem under a volume results in a panic Authored by: Andriy Gapon <avg@FreeBSD.org> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> When querying ZPL properties verify that the objset is of type DMU_OST_ZFS. OpenZFS-issue: https://www.illumos.org/issues/6101 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ce2243a Closes #6015	2017-04-14 11:11:28 -07:00
Andriy Gapon	31b6bc74b9	OpenZFS 8026 - retire zfs_throttle_delay and zfs_throttle_resolution Authored by: Andriy Gapon <avg@FreeBSD.org> Approved by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8026 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9b33e07 Closes #6014	2017-04-14 11:11:20 -07:00
Debabrata Banerjee	66aca24730	SEEK_HOLE should not block on txg_wait_synced() Force flushing of txg's can be painfully slow when competing for disk IO, since this is a process meant to execute asynchronously. Optimize this path via allowing data/hole seeking if the file is clean, but if dirty fall back to old logic. This is a compromise to disabling the feature entirely. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Closes #4306 Closes #5962	2017-04-13 10:51:20 -07:00
Brian Behlendorf	e550644f0c	OpenZFS 5120 - zfs should allow large block/gzip/raidz boot pool (loader project) Authored by: Toomas Soome <tsoome@me.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Don Brady <don.brady@intel.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting Notes: - grub-2.02-beta2-422-gcad5cc0 includes support for large blocks. - Commit `8aab121` allowed GZIP[1-9]. - Grub allows pools with multiple top-level vdevs. OpenZFS-issue: https://www.illumos.org/issues/5120 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c8811bd Closes #6007	2017-04-13 09:40:00 -07:00
Brian Behlendorf	00481e7dad	OpenZFS 7503 - zfs-test should tail ::zfs_dbgmsg on test failure Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting Notes: - Enable internal log for DEBUG builds and in zfs-tests.sh. - callbacks/zfs_dbgmsg.ksh - Dump interal log via kstat. - callbacks/zfs_dmesg.ksh - Dump dmesg log. - default.cfg - 'Test Suite Specific Commands' dropped. OpenZFS-issue: https://www.illumos.org/issues/7503 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/55a1300 Closes #6002	2017-04-12 13:36:48 -07:00
Giuseppe Di Natale	17b43f96f9	Skip rate limiting events in zfs_ereport_post In zfs_ereport_post, if an event is a rate limiting event, immediately return before any processing is done. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #5998	2017-04-11 18:37:45 -07:00
LOLi	047187c1bd	Fix size inflation in spa_get_worst_case_asize() When we try assign a new transaction to a TXG we must know beforehand if there is sufficient free space on disk. This is to decide, in dmu_tx_assign(), if we should reject the TX with ENOSPC. We rely on spa_get_worst_case_asize() to inflate the size of our logical writes by a factor of spa_asize_inflation which is calculated as: (VDEV_RAIDZ_MAXPARITY + 1) * SPA_DVAS_PER_BP * 2 == 24 The problem with the current implementation is that we don't take into account what happens with very small writes on VDEVs with large physical block sizes. Consider the case of writes to a dataset with recordsize=512, copies=3 on a VDEV with ashift=13 (usually SSD with 8K block size): every logical IO will end up allocating 3 * 8K = 24K on disk, so 512 bytes multiplied by 48, which is double the size we account for. If we allow this kind of writes to be assigned a TX it is possible, when the pool is almost full, to trigger an allocation failure (ENOSPC) in the ZIO pipeline, which will in turn result in the whole pool being suspended. The bug is fixed by using, in spa_get_worst_case_asize(), the MAX() value chosen between the logical io size from zfs_write() and the maximum physical block size used among our VDEVs. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #5941	2017-04-10 15:28:21 -07:00
Matthew Ahrens	8542ef852a	OpenZFS 8005 - poor performance of 1MB writes on certain RAID-Z configurations Authored by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Don Brady <don.brady@intel.com> Ported-by: Matt Ahrens <mahrens@delphix.com> RAID-Z requires that space be allocated in multiples of P+1 sectors, because this is the minimum size block that can have the required amount of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2 sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector is a unit of 2^ashift bytes, typically 512B or 4KB. To satisfy this constraint, the allocation size is rounded up to the proper multiple, resulting in up to 3 "pad sectors" at the end of some blocks. The contents of these pad sectors are not used, so we do not need to read or write these sectors. However, some storage hardware performs much worse (around 1/2 as fast) on mostly-contiguous writes when there are small gaps of non-overwritten data between the writes. Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that include pad sectors. If writing a pad sector will fill the gap between two (required) writes, we will issue the optional zio, thus doubling performance. The gap-filling performance improvement was introduced in July 2009. Writing the optional zio is done by the io aggregation code in vdev_queue.c. The problem is that it is also subject to the limit on the size of aggregate writes, zfs_vdev_aggregation_limit, which is by default 128KB. For a given block, if the amount of data plus padding written to a leaf device exceeds zfs_vdev_aggregation_limit, the optional zio will not be written, resulting in a ~2x performance degradation. The problem occurs only for certain values of ashift, compressed block size, and RAID-Z configuration (number of parity and data disks). It cannot occur with the default recordsize=128KB. If compression is enabled, all configurations with recordsize=1MB or larger will be impacted to some degree. The problem notably occurs with recordsize=1MB, compression=off, with 10 disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore this problem has been known as "the 1MB 10-wide RAIDZ2 (or 3) problem". The problem also occurs with the following configurations: With recordsize=512KB or 256KB, compression=off, the problem occurs only in rarely-used configurations: * 4-wide RAIDZ1 with recordsize=512KB and ashift=12 (4KB sectors) * 4-wide RAIDZ2 (either recordsize, either ashift) * 5-wide RAIDZ2 with recordsize=512KB (either ashift) * 6-wide RAIDZ2 with recordsize=512KB (either ashift) With recordsize=1MB, compression=off, ashift=9 (512B sectors) * RAIDZ1 with 4 or 8 disks * RAIDZ2 with 4, 8, or 10 disks * RAIDZ3 with 6, 8, 9, or 10 disks With recordsize=1MB, compression=off, ashift=12 (4KB sectors) * RAIDZ1 with 7 or 8 disks * RAIDZ2 with 4, 5, or 10 disks * RAIDZ3 with 6, 9, or 10 disks With recordsize=2MB and larger (which can only be selected by changing kernel tunables), many configurations are affected, including with higher numbers of disks (up to 18 disks with recordsize=2MB). Increase zfs_vdev_aggregation_limit to allow the optional zio to be aggregated, thus eliminating the problem. Setting it to 256KB fixes all commonly-used configurations. The solution is to aggregate optional zio's regardless of the aggregation size limit. Analysis sponsored by Intel Corp. OpenZFS-issue: https://www.illumos.org/issues/8005 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/321 Closes #5931	2017-04-10 15:21:45 -07:00
Giuseppe Di Natale	42db43e982	OpenZFS 2932 - support crash dumps to raidz, etc. pools Authored by: Bill Pijewski <wdp@joyent.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/2932 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/810e43b Closes #5984 Closes #5216	2017-04-10 10:24:17 -07:00
George Wilson	3b7f360c96	OpenZFS 8023 - Panic destroying a metaslab deferred range tree Authored by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> We don't want to dirty any data when we're in the final txgs of the pool export logic. This change introduces checks to make sure that no data is dirtied after a certain point. It also addresses the culprit of this specific bug – the space map cannot be upgraded when we're in final stages of pool export. If we encounter a space map that wants to be upgraded in this phase, then we simply ignore the request as it will get retried the next time we set the fragmentation metric on that metaslab. OpenZFS-issue: https://www.illumos.org/issues/8023 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2ef00f5 Closes #5991	2017-04-09 16:12:35 -07:00
Andriy Gapon	c0c8cc7b43	OpenZFS 8027 - tighten up dsl_pool_dirty_delta Authored by: Andriy Gapon <avg@FreeBSD.org> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8027 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/642668d Closes #5988	2017-04-09 16:04:35 -07:00
Don Brady	177c91d06e	Fix regression in zfs_ereport_start() On 32-bit platforms spa_state is 32 bits without cast, and thus caused a NULL pointer dereference when treated as 64bit in var arg. Accidentally introduced by `bcdb96a`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com> Signed-off-by: Don Brady <don.brady@intel.com> Closes #5966 Closes #5965	2017-04-05 14:24:26 -07:00
Steven Hartland	2e215fecbe	OpenZFS 7885 - zpool list can report 16.0e for expandsz Authored by: Steven Hartland <steven.hartland@multiplay.co.uk> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> When a member of a RAIDZ has been replaced with a device smaller than the original, then the top level vdev can report its expand size as 16.0E. The reduced child asize causes the RAIDZ to have a vdev_asize lower than its vdev_max_asize which then results in an underflow during the calculation of the parents expand size. Fix this by updating the vdev_asize if it shrinks, which is already protected by a check against vdev_min_asize so should always be safe. Also for RAIDZ vdevs, ensure that the sum of their child vdev_min_asize is always greater than the parents vdev_min_size. Reviewed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7885 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bb0dbaa Closes #5963	2017-04-05 09:33:20 -07:00
N Clark	bcdb96a3e1	Additional Information for Zedlets * Add ZPOOL pool state to zfs_post_common to allow differentiation between export and destroy by zedlets. * Add pool name as standard export This ensures pool name is exported to zedlets. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nathaniel Clark <nathaniel.l.clark@intel.com> Closes #5942	2017-04-03 14:23:02 -07:00
Gvozden Neskovic	84c07adadb	Remove dependency on linear ABD Wherever possible it's best to avoid depending on a linear ABD. Update the code accordingly in the following areas. - vdev_raidz - zio, zio_checksum - zfs_fm - change abd_alloc_for_io() to use abd_alloc() Reviewed-by: David Quigley <david.quigley@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #5668	2017-03-29 12:24:51 -07:00
LOLi	ff61d1a495	Check ashift validity in 'zpool add' `df83110` added the ability to specify a custom "ashift" value from the command line in 'zpool add' and 'zpool attach'. This commit adds additional checks to the provided ashift to prevent invalid values from being used, which could result in disastrous consequences for the whole pool. Additionally provide ASHIFT_MAX and ASHIFT_MIN definitions in spa.h. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #5878	2017-03-28 17:21:11 -07:00
Chunwei Chen	12aec7dcd9	Fix wrong offset args in vdev_cache_write The offset arguments is wrong when changing to abd_copy_off in `a6255b7` Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5932 Closes #5936	2017-03-28 11:06:22 -07:00
Brian Behlendorf	8d70398740	Retry zfs_znode_alloc() in zfs_mknode() For historical reasons zfs_mknode() was written such that it could never fail. This poses a problem for Linux since zfs_znode_alloc() could potentually failure due to low memory. Handle this gracefully by retrying zfs_znode_alloc() until it succeeds, direct reclaim will eventually be able to allocate memory. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5535 Closes #5908	2017-03-23 18:26:50 -07:00
George Wilson	55922e73b4	OpenZFS 3821 - Race in rollback, zil close, and zil flush Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Approved by: Richard Lowe <richlowe@richlowe.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/3821 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/43297f9 Closes #5905	2017-03-23 18:20:58 -07:00
wli5	6a9d635998	GZIP compression offloading with QAT accelerator This patch implement the hardware accelerator method in GZIP compression in ZFS. When the ZFS pool is enabled GZIP compression, the compression API will be automatically transferred to the hardware accelerator to free up CPU resource and speed up the compression time. * To enable Intel QAT hardware acceleration in ZOL you need to have QAT hardware and the driver installed: * QAT hardware DH8950: http://ark.intel.com/products/79483/Intel-QuickAssist-Adapter-8950 * QAT driver: https://01.org/intel-quickassist-technology * Start QAT driver in your system: service qat_service start * Enable QAT in ZFS, e.g.: ./configure --with-qat=<qat-driver-path>/QAT1.6 make * Set GZIP compression in ZFS dataset: zfs set compression = gzip <dataset> * Get QAT hardware statistics by: cat /proc/spl/kstat/zfs/qat * To disable QAT in ZFS: insmod zfs.ko zfs_qat_disable=1 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com> Signed-off-by: Weigang Li <weigang.li@intel.com> Closes #5846	2017-03-22 17:58:47 -07:00
Matthew Ahrens	64fc776208	OpenZFS 7968 - multi-threaded spa_sync() Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Matthew Ahrens <mahrens@delphix.com> spa_sync() iterates over all the dirty dnodes and processes each of them by calling dnode_sync(). If there are many dirty dnodes (e.g. because we created or removed a lot of files), the single thread of spa_sync() calling dnode_sync() can become a bottleneck. Additionally, if many dnodes are dirtied concurrently in open context (e.g. due to concurrent file creation), the os_lock will experience lock contention via dnode_setdirty(). The solution is to track dirty dnodes on a multilist_t, and for spa_sync() to use separate threads to process each of the sublists in the multilist. OpenZFS-issue: https://www.illumos.org/issues/7968 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4a2a54c Closes #5752	2017-03-20 18:36:00 -07:00
Olaf Faaland	a3478c0747	Linux 4.11 compat: iops.getattr and friends In torvalds/linux@a528d35, there are changes to the getattr family of functions, struct kstat, and the interface of inode_operations .getattr. The inode_operations .getattr and simple_getattr() interface changed to: int (getattr) (const struct path , struct dentry , struct kstat , u32 request_mask, unsigned int query_flags) The request_mask argument indicates which field(s) the caller intends to use. Fields the caller has not specified via request_mask may be set in the returned struct anyway, but their values may be approximate. The query_flags argument indicates whether the filesystem must update the attributes from the backing store. Currently both fields are ignored. It is possible that getattr-related functions within zfs could be optimized based on the request_mask. struct kstat includes new fields: u32 result_mask; /* What fields the user got / u64 attributes; / See STATX_ATTR_* flags / struct timespec btime; / File creation time */ Fields attribute and btime are cleared; the result_mask reflects this. These appear to be optional based on simple_getattr() and vfs_getattr() within the kernel, which take the same approach. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #5875	2017-03-20 17:51:16 -07:00
Matthew Ahrens	9522bd2429	OpenZFS 7801 - add more by-dnode routines (lint) Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7801 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f25efb3 Closes #5894	2017-03-20 12:33:17 -07:00
Matthew Ahrens	8614ddf9b4	OpenZFS 6874 - rollback and receive need to reset ZPL state to what's on disk Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> When we do a clone swap (caused by "zfs rollback" or "zfs receive"), the ZPL doesn't completely reload the state from the DMU; some values remain cached in the zfsvfs_t. OpenZFS-issue: https://www.illumos.org/issues/6874 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1fdcbd0 Closes #5888	2017-03-13 17:42:42 -07:00
Brian Behlendorf	1c2555ef92	Restructure mount option handling Restructure the handling of mount options to be consistent with upstream OpenZFS. This required making the following changes. - The zfs_mntopts_t was renamed vfs_t and adjusted to provide the minimal needed functionality. This includes a pointer back to the associated zfsvfs_t. Plus it made it possible to revert zfs_register_callbacks() and zfsvfs_create() back to their original prototypes. - A zfs_mnt_t structure was added for the sole purpose of providing a structure to pass the osname and raw mount pointer to zfs_domount() without having to copy them. - Mount option parsing was moved down from the zpl_* wrapper functions in to the zfs_* functions. This allowed for the code to be simplied and it's where similar functionality appears on other platforms. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2017-03-10 09:51:41 -08:00
Brian Behlendorf	f298b24ddf	Rename zfs_* functions Several functions were renamed when ZFS was originally ported to Linux. Revert the code to the original names to minimize the delta with upstream OpenZFS. zfs_sb_teardown -> zfsvfs_teardown zfs_sb_create -> zfsvfs_create zfs_sb_setup -> zfsvfs_setup zfs_sb_free -> zfsvfs_free get_zfs_sb -> getzfsvfs zfs_sb_hold -> zfsvfs_hold zfs_sb_rele -> zfsvfs_rele zfs_sb_prune_aliases -> zfs_prune_aliases (Linux-only) zfs_sb_prune -> zfs_prune (Linux only) Align the zfs_vnops.h and zfs_vfsops.h with upstream as much as possible. Several prototypes were removed and those that remain were reordered. Move the EXPORT_SYMBOL lines to the end of the source files for consistency with the other source files. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2017-03-10 09:51:35 -08:00
Brian Behlendorf	0037b49e83	Rename zfs_sb_t -> zfsvfs_t The use of zfs_sb_t instead of zfsvfs_t results in unnecessary conflicts with the upstream source. Change all instances of zfs_sb_t to zfsvfs_t including updating the variables names. Whenever possible the code was updated to be consistent with hope it appears in the upstream OpenZFS source. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2017-03-10 09:51:33 -08:00
Brian Behlendorf	ef1bdf363c	Fix ZVOL BLKFLSBUF ioctl The BLKFLSBUF ioctl is expected to do two things: - flush dirty pages to stable storage, and - invalidate clean pages Unfortunately, the existing implementation of BLKFLSBUF in zvol_ioctl() only flushes pages which are part of the current TXG to disk. There may be additional dirty pages in the page cache which haven't yet been submitted to the DMU and therefore aren't part of any TXG. Furthermore because zvol_ioctl() returns 0 the generic blkdev_flushbuf() does not invalidate the page cache. Resolve the issue by moving bdev_flush() in to zvol_ioctl() and explicitly waiting for a full TXG sync. Then invalidate the page cache. The associated ARC buffers need not be evicted since they cannot be bypassed using O_DIRECT. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5871 Closes #5879	2017-03-09 17:43:36 -08:00
Giuseppe Di Natale	589bb918ef	Suppress cppcheck nullPointer error in zfs_write Newer versions of cppcheck find the potential NULL pointer bug in zfs_write(). The function is difficult to refactor without extensive work, so suppress the potential NULL pointer error which cannot occur for now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #5882	2017-03-09 17:40:21 -08:00
Chunwei Chen	9b77d1c958	Fix nfs snapdir automount The current implementation for allowing nfs to access snapdir is very buggy. It uses a special fh for snapdirs, such that the next time nfsd does fh_to_dentry, it actually returns the root inode inside the snapshot. So nfsd never knows it cross a mountpoint. The problem is that nfsd will not hold a reference on the vfsmount of the snapshot. This cause auto unmounter to unmount the snapshot even though nfs is still holding dentries in it. To fix this, we return the inode for the snapdirs themselves. However, we also trigger automount upon fh_to_dentry, and return ESTALE so nfsd will revalidate and see the mountpoint and do crossmnt. Because nfsd will now be aware that these are different filesystems users must add crossmnt to their export options to access snapshot directories. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #3794 Closes #4716 Closes #5810 Closes #5833	2017-03-08 09:26:33 -08:00
Andriy Gapon	423e7b6261	OpenZFS 7867 - ARC space accounting leak Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7867 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aa1f740d Closes #5874	2017-03-07 18:14:32 -08:00
Brian Behlendorf	3ec3bc2167	OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Background information: This assertion about tx_space_* verifies that we are not dirtying more stuff than we thought we would. We “need” to know how much we will dirty so that we can check if we should fail this transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() — typically less than a millisecond), we call dbuf_dirty() on the exact blocks that will be modified. Once this happens, the temporary accounting in tx_space_* is unnecessary, because we know exactly what blocks are newly dirtied; we call dnode_willuse_space() to track this more exact accounting. The fundamental problem causing this bug is that dmu_tx_hold_() relies on the current state in the DMU (e.g. dn_nlevels) to predict how much will be dirtied by this transaction, but this state can change before we actually perform the transaction (i.e. call dbuf_dirty()). This bug will be fixed by removing the assertion that the tx_space_ accounting is perfectly accurate (i.e. we never dirty more than was predicted by dmu_tx_hold_()). By removing the requirement that this accounting be perfectly accurate, we can also vastly simplify it, e.g. removing most of the logic in dmu_tx_count_(). The new tx space accounting will be very approximate, and may be more or less than what is actually dirtied. It will still be used to determine if this transaction will put us over quota. Transactions that are marked by dmu_tx_mark_netfree() will be excepted from this check. We won’t make an attempt to determine how much space will be freed by the transaction — this was rarely accurate enough to determine if a transaction should be permitted when we are over quota, which is why dmu_tx_mark_netfree() was introduced in 2014. We also won’t attempt to give “credit” when overwriting existing blocks, if those blocks may be freed. This allows us to remove the do_free_accounting logic in dbuf_dirty(), and associated routines. This logic attempted to predict what will be on disk when this txg syncs, to know if the overwritten block will be freed (i.e. exists, and has no snapshots). OpenZFS-issue: https://www.illumos.org/issues/7793 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3704e0a Upstream bugs: DLPX-32883a Closes #5804 Porting notes: - DNODE_SIZE replaced with DNODE_MIN_SIZE in dmu_tx_count_dnode(), Using the default dnode size would be slightly better. - DEBUG_DMU_TX wrappers and configure option removed. - Resolved _by_dnode() conflicts these changes have not yet been applied to OpenZFS.	2017-03-07 09:51:59 -08:00
Brian Behlendorf	e2fcb56275	OpenZFS 7843 - get_clones_stat() is suboptimal for lots of clones Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7843 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4d519e7 Closes #5868	2017-03-07 09:47:40 -08:00
Chunwei Chen	7a789346af	Fix loop device becomes read-only Commit `933ec99` removes read and write from f_op because the vfs layer will select iter_write or aio_write automatically. However, for Linux <= 4.0, loop_set_fd will actually check f_op->write and set read-only if not exists. This patch add them back and use the generic do_sync_{read,write} for aio_{read,write} and new_sync_{read,write} for {read,write}_iter. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5776 Closes #5855	2017-03-06 09:20:20 -08:00
Olaf Faaland	4859fe796c	Linux 4.11 compat: avoid refcount_t name conflict Linux 4.11 introduces a new type, refcount_t, which conflicts with the type of the same name defined within ZFS. Rename the ZFS type zfs_refcount_t. Within the ZFS code, use a macro to cause references to refcount_t to be changed to zfs_refcount_t at compile time. This reduces conflicts when later landing OpenZFS patches. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #5823 Closes #5842	2017-02-28 16:10:18 -08:00
Matthew Ahrens	66eead53c9	Clean up by-dnode code in dmu_tx.c `0eef1bde31` introduced some changes which we slightly improved the style of when porting to illumos. There is also one minor error-handling fix, in zap_add() the "zap" may become NULL in case of an error re-opening the ZAP. Originally suggested at: https://github.com/openzfs/openzfs/pull/276 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #5805	2017-02-24 13:34:26 -08:00
Isaac Huang	f7e76821c5	ABD style cleanups The commit `a6255b7fce` removed a few assertions which help catch errors and improve code readability. It also duplicated two conditionals, which was unnecessary and made the code confusing to read. This patch cleans it up. Reviewed-by: David Quigley <david.quigley@intel.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Isaac Huang <he.huang@intel.com> Closes #5802	2017-02-24 12:05:42 -08:00
Daniel Hoffman	9e2c3bb4b9	OpenZFS 7812 - Remove gender specific language Authored by: Daniel Hoffman <dj.hoffman@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> This change removes all gendered language that did not refer specifically to an individual person or pet. The convention taken was to use variations on "they" when referring to users and/or human beings, while using "it" when referring to code, functions, and/or libraries. Additionally, we took the liberty to fix up any whitespace issues that were found in any files that were already being modified. OpenZFS-issue: https://www.illumos.org/issues/7812 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ad626db Closes #5822	2017-02-24 11:07:04 -08:00
Andriy Gapon	0efd97912a	OpenZFS 7199 - dsl_dataset_rollback_sync may try to free already free blocks 7200 no blocks must be born in a txg after a snaphot is created Authored by: Andriy Gapon <andriy.gapon@clusterhq.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7199 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bfaed0b Closes #5817	2017-02-24 11:05:33 -08:00
Isaac Huang	6d82f98c3d	Fix incorrect spare vdev state after replacing After a hot spare replaces an OFFLINE vdev, the new parent spare vdev state is set incorrectly to OFFLINE. The correct state should be DEGRADED. The incorrect OFFLINE state will prevent top-level vdev from reading the spare vdev, thus causing unnecessary reconstruction. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Isaac Huang <he.huang@intel.com> Closes #5766 Closes #5770	2017-02-23 10:32:15 -08:00
Matthew Ahrens	c30e58c462	zfs_arc_num_sublists_per_state should be common to all multilists The global tunable zfs_arc_num_sublists_per_state is used by the ARC and the dbuf cache, and other users are planned. We should change this tunable to be common to all multilists. This tuning may be overridden on a per-multilist basis. Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #5764	2017-02-15 15:49:33 -08:00
Tim Chase	544596c59e	Fix zfs_compressed_arc_enabled parameter description A likely cut/paste error caused the description to be applied to zfs_arc_average_blocksize. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #5788	2017-02-13 10:59:05 -08:00
Chunwei Chen	d6df043c53	Fix off by one in zpl_lookup Doing the following command would return success with zfs creating an orphan object. touch $(for i in $(seq 256); do printf "n"; done) The funny thing is that this will only work once for each directory, because after upgraded to fzap, zfs_lookup would fail properly since it has additional length check. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5768	2017-02-11 12:42:17 -08:00
Matthew Ahrens	d7958b4cda	OpenZFS 7104 - increase indirect block size Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7104 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4b5c8e9 Closes #5679	2017-02-09 10:27:02 -08:00
Matthew Ahrens	df7eeccc75	panic in bpobj_space(): null pointer dereference This is a race condition in the deadlist code. A thread executing an administrative command that uses dsl_deadlist_space_range() holds the lock of the whole deadlist_t to protect the access of all its entries that the deadlist contains in an avl tree. Sync threads trying to insert a new entry in the deadlist (through dsl_deadlist_insert() -> dle_enqueue()) do not hold the deadlist lock at that moment. If the dle_bpobj is the empty bpobj (our sentinel value), we close and reopen it. Between these two operations, it is possible for the dsl_deadlist_space_range() thread to dereference that bpobj which is NULL during that window. Threads should hold the a deadlist's dl_lock when they manipulate its internal data so scenarios like the one above are avoided. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #5762	2017-02-09 10:19:12 -08:00
Brian Behlendorf	ea7e86d8db	Fix iput() calls within a tx As explicitly stated in section 2 of the 'Programming rules' comments at the top of zfs_vnops.c. If you must call iput() within a tx then use zfs_iput_async(). Move iput() calls after dmu_tx_commit() / dmu_tx_abort when possible. When not possible convert the iput() calls to zfs_iput_async(). Reviewed-by: Don Brady <don.brady@intel.com> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5758	2017-02-08 17:28:22 -08:00
Matthew Ahrens	4a5d7f8267	Allow c99 code to compile Add the appropriate compiler flags to accept c99 code. This will help to minimize differences with upstream, and aid porting changes. One change was necessary in zvol.c because the DEFINE_IDA() macro does not work with the new compiler flags. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #5756	2017-02-08 09:27:48 -08:00
David Quigley	bef78122e6	Add missing module_param for zfs_per_txg_dirty_frees_percent When the code was added this tunable was not exposed via module params. Also it was not documented. This patch changes the type from a uint32 to a ulong as done with other percentage tunables and also documents it in the zfs-module-parameters man page. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: David Quigley <david.quigley@intel.com> Closes #5750	2017-02-07 09:44:03 -08:00
George Melikov	298ec40b6d	OpenZFS 7448 - ZFS doesn't notice when disk vdevs have no write cache Authored by: Hans Rosenfeld <hans.rosenfeld@nexenta.com> Reviewed by: Dan Fields <dan.fields@nexenta.com> Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed-by: Don Brady <don.brady@intel.com> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7448 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/295438b Closes #5737	2017-02-04 09:23:50 -08:00
George Melikov	0a252daed3	OpenZFS 7504 - kmem_reap hangs spa_sync and administrative tasks Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7504 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/405a5a0 Closes #5736	2017-02-04 09:21:25 -08:00
George Melikov	2e0e443ac4	OpenZFS 7247 - zfs receive of deduplicated stream fails Authored by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7247 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/2ad25b4 Closes #5689 Porting notes: - tests/zfs-tests/tests/functional/cli_root/zfs_receive/zfs_receive_013_pos.ksh renamed as zfs_receive_015_pos.ksh, zfs_receive_013_pos.ksh is now used for OpenZFS test. - libzfs_sendrecv.c: SMALLEST_POSSIBLE_MAX_DDT_MB is always used for all 32-bit builds.	2017-02-04 09:10:24 -08:00
George Melikov	9b7b9cd370	OpenZFS 1300 - filename normalization doesn't work for removes Authored by: Kevin Crowe <kevin.crowe@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/1300 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8f1750d Closes #5725 Porting notes: - zap_micro.c: all `MT_EXACT` are replaced by `0`	2017-02-02 14:13:41 -08:00
Chunwei Chen	c7af63d62a	Fix write(2) returns zero bug from `933ec99` For generic_write_checks with 2 args, we can exit when it returns zero because it means count is zero. However this is not the case for generic_write_checks with 4 args, where zero means no error. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5720 Closes #5726	2017-02-02 09:43:42 -08:00
Giuseppe Di Natale	fc386db191	Remove lint ifdef checks in zdb and dbuf This is effectively dead code for the Linux implementation which can be removed to improve readability. We want to linter to check the real production/debug build as much as possible. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #5722	2017-02-01 16:47:04 -08:00
George Melikov	0f676dc228	OpenZFS 7072 - zfs fails to expand if lun added when os is in shutdown state Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7072 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c39a2aa Closes #5694 Porting notes: - vdev.c: 'vdev_get_stats' changes are moved to 'vdev_get_stats_ex'. - vdev_disk.c: ignored, Linux specific code is different.	2017-02-01 13:14:02 -08:00
David Quigley	2fe36b0bfb	Use fletcher_4 routines natively with `abd_iterate_func()` This patch adds the necessary infrastructure for ABD to make use of the vectorized fletcher 4 routines. - export ABD compatible interface from fletcher_4 - add ABD fletcher_4 tests for data and metadata ABD types. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Original-patch-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: David Quigley <david.quigley@intel.com> Closes #5589	2017-02-01 09:34:22 -08:00
George Melikov	539d33c791	OpenZFS 6569 - large file delete can starve out write ops Authored by: Alek Pinchuk <alek@nexenta.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> Tested-by: kernelOfTruth <kerneloftruth@gmail.com> OpenZFS-issue: https://www.illumos.org/issues/6569 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1bf4b6f2 Closes #5706	2017-01-31 14:44:03 -08:00
Giuseppe Di Natale	d69a321e56	OpenZFS 7545 - zdb should disable reference tracking Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> Porting Notes: Moved reference_tracking_enable and reference_history outside of ZFS_DEBUG. OpenZFS-issue: https://www.illumos.org/issues/7545 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4dd77f9 Closes #5701	2017-01-31 14:36:35 -08:00
Tim Chase	b81a3ddc32	Update deadman operation to better align with upstream OpenZFS The deadman in ZoL didn't behave quite as it did in upstream OpenZFS. In addition to the 2 purposes for which OpenZFS used the zfs_deadman_synctime_ms parameter, ZoL also used it to determine how frequently the deadman would fire once it has been triggered. This patch adds the zfs_deadman_checktime_ms parameter to control how frequently the subsequent checks are performed. The deadman is now disabled for suspended pools. As had been the case, unlike upstream OpenZFS, ZoL will not panic when a hung IO is detected. The module parameter documentation has been upated to include the new parameter and to better describe the operation of the deadmen. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #5695	2017-01-31 14:19:08 -08:00
George Melikov	005e27e3b3	OpenZFS 7019 - zfsdev_ioctl skips secpolicy when FKIOCTL is set Authored by: Alex Wilson <alex.wilson@joyent.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7019 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45b1747 Closes #5709	2017-01-31 10:24:23 -08:00
George Melikov	6325e48f95	OpenZFS 7136 - ESC_VDEV_REMOVE_AUX ought to always include vdev information Authored by: Alan Somers <asomers@gmail.com> 7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7136 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b72b6bb Closes #5691 Porting notes: - Functionally this patch behaves the same as the OpenZFS version but it was adapted because because ZoL doesn't have the same illumos sysevent_t infrastructure and functionality.	2017-01-31 10:19:36 -08:00
George Melikov	41425f79da	OpenZFS 7490 - real checksum errors are silenced when zinject is on Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7490 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/6cedfc3 Closes #5693	2017-01-30 17:12:58 -08:00
George Melikov	e2da829cc1	OpenZFS 6922 - Emit ESC_ZFS_VDEV_REMOVE_AUX after removing an aux device Authored by: Alan Somers <asomers@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/6922 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/63364b0 Closes #5690	2017-01-30 15:33:46 -08:00
George Melikov	fa603f8233	OpenZFS 7277 - zdb should be able to print zfs_dbgmsg's Porting notes: - 'zfs_dbgmsg_print()' reintroduced to userspace. Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7277 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/29bdd2f Closes #5684	2017-01-28 12:16:43 -08:00
Brian Behlendorf	a32494d22a	Fix suspend Godfather I/Os io_reexecute bits After resuming a pool the godfather zio could have both the ZIO_REEXECUTE_NOW and ZIO_REEXECUTE_SUSPEND bits set. This can occur if some child zios set ZIO_REEXECUTE_NOW while other set ZIO_REEXECUTE_SUSPEND. The godfather zio can inherit both flags in zio_notify_parent(). The child zios which assigned the ZIO_REEXECUTE_SUSPEND flag will be removed from the godfather's child list and added to the spa->spa_suspend_zio_root child list. While child zios with the ZIO_REEXECUTE_NOW bit set remain being monitored by the godfather zio. When the godfather zio executes zio_done() the presence of the ZIO_REEXECUTE_SUSPEND bit results in all io_reexecute being cleared. These child zios will then not be re-executed and instead will be destroyed and lost. The most straight forward way to address this situation is to only clear the ZIO_REEXECUTE_SUSPEND bit and leave the ZIO_REEXECUTE_NOW bit set. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: yuxiang <guo.yong33@zte.com.cn>	2017-01-28 12:13:34 -08:00
George Melikov	721ed0ee86	OpenZFS 7580 - ztest failure in dbuf_read_impl Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7580 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3105d95 Closes #5678	2017-01-28 12:11:09 -08:00
George Melikov	a08abc1bb3	OpenZFS 7301 - zpool export -f should be able to interrupt file freeing Authored by: Alek Pinchuk <alek@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7301 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/eb72182 Closes #5680	2017-01-27 11:46:39 -08:00
George Melikov	cc9bb3e58e	OpenZFS 7254 - ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7254 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c166b69 Closes #5670	2017-01-27 11:43:42 -08:00
Chunwei Chen	933ec99951	Retire .write/.read file operations The .write/.read file operations callbacks can be retired since support for .read_iter/.write_iter and .aio_read/.aio_write has been added. The vfs_write()/vfs_read() entry functions will select the correct interface for the kernel. This is desirable because all VFS write/read operations now rely on common code. This change also add the generic write checks to make sure that ulimits are enforced correctly on write. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5587 Closes #5673	2017-01-27 10:43:39 -08:00
Brian Behlendorf	986dd8aacc	OpenZFS 5561 - support root pools on EFI/GPT partitioned disks Reviewed by: Jean McCormack <jean.mccormack@nexenta.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/5561 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1a902ef Closes #5672	2017-01-27 10:40:02 -08:00
Tim Chase	258553d3d7	OpenZFS 7613 - ms_freetree[4] is only used in syncing context metaslab_t:ms_freetree[TXG_SIZE] is only used in syncing context. We should replace it with two trees: the freeing tree (ranges that we are freeing this syncing txg) and the freed tree (ranges which have been freed this txg). Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://www.illumos.org/issues/7613 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a8698da2 Closes #5598	2017-01-26 15:27:19 -08:00
George Melikov	9c9531cb6f	OpenZFS 7500 - Simplify dbuf_free_range by removing dn_unlisted_l0_blkid Authored by: Stephen Blinick <stephen.blinick@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7500 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/653af1b Closes #5639	2017-01-26 15:15:48 -08:00
George Melikov	39efbde7c5	OpenZFS 6676 - Race between unique_insert() and unique_remove() causes ZFS fsid change Authored by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Dan Vatca <dan.vatca@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/6676 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/40510e8 Closes #5667	2017-01-26 14:43:28 -08:00
George Melikov	1149ba6478	OpenZFS 7606 - dmu_objset_find_dp() takes a long time while importing pool When importing a pool with a large number of filesystems within the same parent filesystem, we see that dmu_objset_find_dp() takes a long time. It is called from 3 places: spa_check_logs(), spa_ld_claim_log_blocks(), and spa_load_verify(). There are several ways to improve performance here: 1. We don't really need to do spa_check_logs() or spa_ld_claim_log_blocks() if the pool was closed cleanly. 2. spa_load_verify() uses dmu_objset_find_dp() to check that no datasets have too long of names. 3. dmu_objset_find_dp() is slow because it's doing zap_value_search() (which is O(N sibling datasets)) to determine the name of each dsl_dir when it's opened. In this case we actually know the name when we are opening it, so we can provide it and avoid the lookup. This change implements fix #3 from the above list; i.e. make dmu_objset_find_dp() provide the name of the dataset so that we don't have to search for it. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com> Reviewed-by: David Quigley <david.quigley@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7606 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cac6bab Closes #5662	2017-01-26 12:46:02 -08:00
George Melikov	ec923db25c	OpenZFS 7180 - potential race between zfs_suspend_fs+zfs_resume_fs and zfs_ioc_rename Authored by: Andriy Gapon <andriy.gapon@clusterhq.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7180 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/690041b Closes #5627	2017-01-23 10:53:46 -08:00
George Melikov	cffd6e1167	OpenZFS 3746 - ZRLs are racy Authored by: Will Andrews <will@freebsd.org> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Reviewed by: Pavel Zakharov <pavel.zakha@gmail.com> Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com> Approved by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/3746 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/260af64 Closes #5625	2017-01-23 10:35:58 -08:00
George Melikov	f85c06bedf	OpenZFS 7054 - dmu_tx_hold_t should use refcount_t to track space Authored by: Igor Kozhukhov ikozhukhov@gmail.com Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov mail@gmelikov.ru OpenZFS-issue: https://www.illumos.org/issues/7054 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0c779ad Closes #5600	2017-01-23 09:36:24 -08:00
George Melikov	4ea3f86426	codebase style improvements for OpenZFS 6459 port	2017-01-22 13:25:40 -08:00
Chunwei Chen	040dab9939	Suspend/resume zvol for recv and rollback When doing recv and rollback, dsl_dataset_clone_swap_sync_impl will be called to swap out the ds_objset and do dmu_objset_evict on the old one. However, currently zv->zv_objset will not be swapped out accordingly, so if anyone currently holds a fd on the zvol, we risk hitting a use-after-free. We fix this by introducing the suspend and resume mechanism of zsb to zv. Before recv or rollback, we use zvol_suspend to block all access to zv_objset and shut it down. After the recv or rollback, we use zvol_resume to swap in zv_objset with the new ds_objset and unblock the access. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4866 Closes #5609	2017-01-19 13:56:36 -08:00
George Melikov	76fe529b39	OpenZFS 6529 - Properly handle updates of variably-sized SA entries Porting notes: - This issue was first fixed in ZoL by commit d862cb0d. That fix was then modified and an equivalent version of the patch landed in the upstream code base. For additional details see the discussion in https://github.com/openzfs/openzfs/pull/24 . This commit aligns ZoL with OpenZFS codebase. Authored by: Andriy Gapon <avg@icyb.net.ua> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Ned Bass <bass6@llnl.gov> Reviewed by: Tim Chase <tim@chase2k.com> Approved by: Gordon Ross <gwr@nexenta.com> Ported-by: George Melikov mail@gmelikov.ru OpenZFS-issue: https://www.illumos.org/issues/6529 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/e7e978b Closes #5606	2017-01-19 13:50:22 -08:00
George Melikov	34a6b42844	OpenZFS 7659 - Missing thread_exit() in dmu_send.c Two threads send_traverse_thread() and receive_writer_thread() should end with thread_exit(); Mostly a cosmetic issue under IllumOS. Authored by: Jorgen Lundman <lundman@lundman.net> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7659 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a569268 Closes #5603	2017-01-18 15:10:35 -08:00
George Melikov	7330fc57b7	OpenZFS 7235 - remove unused func dsl_dataset_set_blkptr Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7235 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bd56f80 Closes #5604	2017-01-17 15:22:56 -08:00
George Melikov	61ca48ff38	OpenZFS 7256 - low probability race in zfs_get_data Authored by: Andriy Gapon <andriy.gapon@clusterhq.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7256 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/6ed18a8 Closes #5601	2017-01-17 15:18:59 -08:00
George Melikov	e88551d52f	OpenZFS 7071 - lzc_snapshot does not fill in errlist on ENOENT Authored by: Igor Kozhukhov ikozhukhov@gmail.com Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7071 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/25f7d99 Closes #5597	2017-01-17 14:52:17 -08:00
George Melikov	cf7d1484bf	OpenZFS 7082 - bptree_iterate() passes wrong args to zfs_dbgmsg() Authored by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/7082 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/10e67aa Closes #5596	2017-01-17 14:49:24 -08:00
Brian Behlendorf	832805d951	OpenZFS 6586 - Whitespace inconsistencies in the spa feature dependency arrays in zfeature_common.c Porting Notes: - Preserved 'static const spa_feature_t hole_birth_deps[]'. Authored by: ilovezfs <ilovezfs@icloud.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6586 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/22b6687 Closes #5592	2017-01-17 14:46:28 -08:00
LOLi	08f0510d87	Fix unallocated object detection for large_dnode datasets Fix dmu_object_next() to correctly handle unallocated objects on large_dnode datasets. We implement this by scanning the dnode block until we find the correct offset to be used in dnode_next_offset(). This is necessary because we can't assume *objectp is a hole even if dmu_object_info() returns ENOENT. This fixes a couple of issues with zfs receive on large_dnode datasets. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #5027 Closes #5532	2017-01-13 15:47:34 -08:00
Brian Behlendorf	5043684ae5	OpenZFS 7603 - xuio_stat_wbuf_* should be declared (void) Porting Notes: - include/sys/dmu.h prototypes were already updated in `0bc8fd7` Authored by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7603 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/99aa8b5 Closes #5586	2017-01-13 15:33:14 -08:00
Brian Behlendorf	9775e98844	OpenZFS 7181 - race between zfs_mount and zfs_ioc_rollback Authored by: Andriy Gapon <andriy.gapon@clusterhq.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7181 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90f2c09 Closes #5585	2017-01-13 15:29:32 -08:00
bzzz77	0eef1bde31	Add _by-dnode routines Add _by_dnode() routines for accessing objects given their dnode_t , this is more efficient than accessing the object by (objset_t , uint64_t object). This change converts some but not all of the existing consumers. As performance-sensitive code paths are discovered they should be converted to use these routines. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com> Closes #5534 Issue #4802	2017-01-13 14:58:41 -08:00
Don Brady	38640550f2	OpenZFS 7743 - per-vdev-zaps init path for upgrade Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Joe Stein <jas14@cs.brown.edu> Ported-by: Don Brady <don.brady@intel.com> When loading a pool that had been created before the existance of per-vdev zaps, on a system that knows about per-vdev zaps, the per-vdev zaps will not be allocated and initialized. This appears to be because the logic that would have done so, in spa_sync_config_object(), is not reached under normal operation. It is only reached if spa_config_dirty_list is non-empty. The fix is to add another `AVZ_ACTION_` enum that will allow this code to be reached when we detect that we're loading an old pool, even when there are no dirty configs. OpenZFS-issue: https://www.illumos.org/issues/7743 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/e2d29d0 Closes #5582	2017-01-13 13:50:22 -08:00
George Melikov	0bc63d83f6	OpenZFS 6603 - zfeature_register() should verify ZFEATURE_FLAG_PER_DATASET implies SPA_FEATURE_EXTENSIBLE_DATASET Authored by: ilovezfs <ilovezfs@icloud.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/6603 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0803e91 Closes #5573	2017-01-12 11:58:04 -08:00
Don Brady	4e21fd060a	OpenZFS 7303 - dynamic metaslab selection This change introduces a new weighting algorithm to improve metaslab selection. The new weighting algorithm relies on the SPACEMAP_HISTOGRAM feature. As a result, the metaslab weight now encodes the type of weighting algorithm used (size-based vs segment-based). Porting Notes: The metaslab allocation tracing code is conditionally removed on linux (dependent on mdb debugger). Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Chris Siden <christopher.siden@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Don Brady <don.brady@intel.com> OpenZFS-issue: https://www.illumos.org/issues/7303 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d5190931bd Closes #5404	2017-01-12 11:52:56 -08:00
George Melikov	e9aa730c49	OpenZFS 6328 - Fix cstyle errors in zfs codebase Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Jorgen Lundman <lundman@lundman.net> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: George Melikov <mail@gmelikov.ru> OpenZFS-issue: https://www.illumos.org/issues/6328 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/9a686fb Closes #5579	2017-01-12 09:42:11 -08:00
ka7	4e33ba4c38	Fix spelling Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se> Closes #5547 Closes #5543	2017-01-03 11:31:18 -06:00
Tim Chase	a5e046eaac	4.10 compat - BIO flag changes and others [bio] The req_op enum was changed to req_opf. Update the "Linux 4.8 API" autotools checks to use an int to determine whether the various REQ_OP values are defined. This should work properly on kernels >= 4.8. [bio] bio_set_op_attrs() is now an inline function and can't be detected with #ifdef. Add a configure check to determine whether bio_set_op_attrs() is defined. Move the local definition of it from vdev_disk.c to blkdev_compat.h for consistency with other related compability shims. [bio] The read/write flags and their modifiers, including WRITE_FLUSH, WRITE_FUA and WRITE_FLUSH_FUA have been removed from fs.h. Add the new bio_set_flush() compatibility wrapper to replace VDEV_WRITE_FLUSH_FUA and set the flags appropriately for each supported kernel version. [vfs] The generic_readlink() function has been made static. If .readlink in inode_operations is NULL, generic_readlink() is used. [zol typo] Completely unrelated to 4.10 compat, fix a typo in the check for REQ_OP_SECURE_ERASE so that the proper macro is defined: s/HAVE_REQ_OP_SECURE_DISCARD/HAVE_REQ_OP_SECURE_ERASE/ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #5499	2016-12-30 16:03:59 -06:00
LOLi	3500a14595	Don't persist temporary pool name on devices Fix a regression accidentally introduced by `e0ab3ab`. Additionally, add a new script zpool_import_014_pos.ksh to the ZFS test suite to exercise 'zpool import -t' functionality. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #5466 Closes #5515	2016-12-22 10:39:00 -08:00
Chunwei Chen	da8f51e16a	Use a dedicated taskq for vdev_file The introduction of parallel zvol prefetch causes deadlock when using vdev_file. spa_async->(spa_namespace_lock)->txg_wait_synced->(wait for txg_sync) txg_sync->zio_wait->(wait for vdev_file_io_fsync on system_taskq) zvol_prefetch_minors_impl (on system_taskq)->spa_open_common->(wait for spa_namespace_lock) We fix this by using dedicated taskq for vdev_file. This same change was originally made in commit `bc25c93` but reverted in commit `aa9af22` when dynamic taskqs were added. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #5506 Closes #5495	2016-12-21 10:47:15 -08:00
LOLi	5f1346c299	Fix dsl_props_set_sync_impl to work with nested nvlist When iterating over the input nvlist in dsl_props_set_sync_impl() when we don't preserve the nvpair name before looking up ZPROP_VALUE, so when we later go to process it nvpair_name() is always "value" and not the actual property name. This fixes a couple of bugs in zfs_ioc_recv(): * Received properties were not restored correctly when failing to receive an incremental send stream * Received properties were not completely replaced by the new ones when successfully receiving an incremental send stream Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #5497	2016-12-20 18:46:59 -08:00
Brian Behlendorf	a3823f428d	Fix file attributes This branch contains the following fixes/improvements. * Fix setting i_flags * Fix wrong operator in xvattr.h * Fix fchange macro in zpl_ioctl_setflags() * Added configure check to use inode_set_flags() * Added a test case for chattr for better test coverage Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5486 Closes #5470 Closes #5469	2016-12-19 13:01:10 -08:00
Chunwei Chen	6c01a4af2b	Fix zmo leak when zfs_sb_create fails zfs_sb_create would normally takes ownership of zmo, and it will be freed in zfs_sb_free. However, when zfs_sb_create fails we need to explicit free it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5490 Closes #5496	2016-12-19 09:46:29 -08:00
Chunwei Chen	a5248129b8	Use inode_set_flags when available Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-12-16 13:54:51 -08:00
Chunwei Chen	c360af5411	Fix fchange in zpl_ioctl_setflags The fchange in zpl_ioctl_setflags was for detecting flag change. However it was incorrect and would always fail to detect a flag change from set to unset, causing users without CAP_LINUX_IMMUTABLE to be able to unset flags. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-12-16 12:46:46 -08:00
Gvozden Neskovic	0101796290	ABD: Adapt avx512bw raidz assembly Adapt avx512bw implementation for use with abd buffers. Mul2 implementation is rewritten to take advantage of the BW instruction set. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Romain Dolbeau <romain.dolbeau@atos.net> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #5477	2016-12-15 17:31:33 -08:00
Chunwei Chen	7bb1325f95	Fix i_flags issue caused by `64c688d` Fix zfs_xvattr_set to set S_IMMUTABLE and S_APPEND flags correctly. Reinstate zfs_set_inode_flags and use it when zfs_xvatter_set and also when setting up inode in zfs_znode_alloc and zfs_rezget. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-12-14 14:48:09 -08:00
Chunwei Chen	f2d8bdc62e	Add ida_destroy in zvol_fini to fix memleak User of ida needs to call ida_destroy after using it. Otherwise ida->free_bitmap and/or other stuff may leak. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5484	2016-12-14 09:41:39 -08:00
bunder2015	f974e25dc1	Fix typos in dbuf.c This removes two large whitespaces in "modinfo zfs" as well as correcting a couple typos. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: bunder2015 <omfgbunder@gmail.com> Closes #5475	2016-12-13 14:21:02 -08:00
Brian Behlendorf	02730c333c	Use cstyle -cpP in `make cstyle` check Enable picky cstyle checks and resolve the new warnings. The vast majority of the changes needed were to handle minor issues with whitespace formatting. This patch contains no functional changes. Non-whitespace changes are as follows: * 8 times ; to { } in for/while loop * fix missing ; in cmd/zed/agents/zfs_diagnosis.c * comment (confim -> confirm) * change endline , to ; in cmd/zpool/zpool_main.c * a number of /* BEGIN CSTYLED / / END CSTYLED / blocks /* CSTYLED / markers change == 0 to ! * ulong to unsigned long in module/zfs/dsl_scan.c * rearrangement of module_param lines in module/zfs/metaslab.c * add { } block around statement after for_each_online_node Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Håkan Johansson <f96hajo@chalmers.se> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5465	2016-12-12 10:46:26 -08:00
Chunwei Chen	a806cb6a89	Don't count '@' for dataset namelen if not a snapshot Don't count '@' for dataset namelen if not a snapshot. This fixes making a pool unimportable when the dataset namelen is 255. Add test file for zfs create name length 255. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5432 Closes #5456	2016-12-09 11:52:08 -07:00
luozhengzheng	c077090a9b	Fix coverity defects: CID 154617 CID 154617: Memory - illegal accesses (UNINIT) The value here just needs to be initialized to make Coverity happy. When dsize == 0, then value of daiter.iter_mapaddr is irrelevant. That address won't be accessed, it's only used for some arithmetic. dsize can be zero either if dabd is null, or if code column is longer than the current data column. Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5437	2016-12-08 14:48:09 -07:00
Brian Behlendorf	f95e647891	Speed up zvol import and export speed Speed up import and export speed by: * Add system delay taskq * Parallel prefetch zvol dnodes during zvol_create_minors * Parallel zvol_free during zvol_remove_minors * Reduce list linear search using ida and hash Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5433	2016-12-08 14:05:02 -07:00
Brian Behlendorf	27f2b90d3e	Revert "Disable zio_dva_throttle_enabled by default" Enable zio_dva_throttle_enabled=1 by default. Subsequent testing has been unable to reproduce the suspected regression. Tested-by: kernelOfTruth kerneloftruth@gmail.com Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov Reverts #5335 Closes #5289 Closes #5457	2016-12-08 13:57:42 -07:00
Gvozden Neskovic	e8a2014436	Cache ddt_get_dedup_dspace() value if there was no ddt changes Save and reuse ddt dspace calculation when there have been no ddt changes. This avoids unnecessary traversal of 168KiB of ddt histograms. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #5425	2016-12-02 16:59:35 -07:00
Brian Behlendorf	baf67d15a5	Refactor txg history kstat It was observed that even when the txg history is disabled by setting `zfs_txg_history=0` the txg_sync thread still fetches the vdev stats unnecessarily. This patch refactors the code such that vdev_get_stats() is no longer called when `zfs_txg_history=0`. And it further reduces the differences between upstream and the ZoL txg_sync_thread() function. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5412	2016-12-02 16:57:49 -07:00
Chunwei Chen	899662e344	zvol_remove_minors do parallel zvol_free On some kernel version, blk_cleanup_queue and put_disk will wait for more then 10ms. So a pool with a lot of zvols will easily wait for more then 1 min if we do zvol_free sequentially. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Requires-spl: refs/pull/588/head	2016-12-02 11:13:35 -08:00
Chunwei Chen	7ac557cef1	zpool_create_minors parallel prefetch Do parallel prefetch all zvol dnodes before actually creating each individual. This will greatly reduce the import time when having a lot of zvols and disk is slow. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-12-02 11:13:35 -08:00
Brian Behlendorf	b0319c1faa	OpenZFS 7143 - dbuf_read() creates unnecessary zio_root() for bonus buf dbuf_read() creates a zio_root() to track and wait for all the zio's that may happen as part of this call. However, if the blkptr_t for this buffer is NULL or a hole, we will not create any more zio's, so this zio_root() is unnecessary. This is always the case when calling dbuf_read() on a bonus buffer, because it has no blkptr (it's part of the containing dnode). For workloads that read a lot of bonus buffers (e.g. file creation and removal), creating and destroying these unnecessary zio's can decrease performance by around 3%. The fix is to only create/destroy the zio_root() in dbuf_read() if the blkptr is not NULL and not a hole. Changes sponsored by Intel Corp. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue openzfs/openzfs#137 Closes #4803 Closes #5382	2016-12-01 16:50:11 -07:00
luozhengzheng	ba712624d6	Fix incorrect operator in abd_alloc_sametype() This should be & and not \| so is_metadata is set correctly. Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5438	2016-12-01 16:45:16 -07:00
cao	e2c7d3785a	Remove unused sa_update_from_cb() It looks like this was functionality which was added in the original SA implementation and then never needed. It can be safely removed now and easily added back if we find a use for it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5440	2016-12-01 16:39:06 -07:00
cao	6a8fd57fa7	Compile zio.h and zio_impl.h mutual include zio.h includes zio_impl.h but zio_impl.h also includes zio.h, so the header files to contain each other. Get rid of the zio_impl.h include in zio.h and update zio_inject.c to include zio.h instead of zio_impl.h. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5439	2016-12-01 16:36:25 -07:00
Chunwei Chen	d45e010dce	zvol: reduce linear list search Use kernel ida to generate minor number, and use hash table to find zvol with name. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-12-01 14:53:11 -08:00
Chunwei Chen	57ddcda164	Use system_delay_taskq for long delay tasks Use it for spa_deadman, zpl_posix_acl_free, snapentry_expire. This free system_taskq from the above long delay tasks, and allow us to do taskq_wait_outstanding on system_taskq without being blocked forever, making system_taskq more generic and useful. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-12-01 14:52:48 -08:00
Brian Behlendorf	a3fd9d9e15	Convert zio_buf_alloc() consumers In multiple cases zio_buf_alloc() was used instead of kmem_alloc() or vmem_alloc(). This was often done because the allocations could be large and it was easy to use zfs_buf_alloc() for them. But this isn't ideal for allocations which are small or short lived. In these cases it is better to use kmem_alloc() or vmem_alloc(). If possible we want to avoid the case where we have slabs allocated for kmem caches which are rarely used. Note for small allocations vmem_alloc() will be internally converted to kmem_alloc(). Therefore as long as large allocations are infrequent and short lived the penalty for using vmem_alloc() is small. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5409	2016-11-30 16:18:20 -07:00
Chunwei Chen	9829574834	ABD optimized page allocation code * Convert ABD to use the Linux Kernel scatterlist implementation instead of the hand rolled one from illumos. * Scatter ABDs are preferentially populated with higher order compound pages from a single zone. Allocation size is progressively decreased until it can be satisfied without performing reclaim or compaction. * An alternate page allocator is provided for kernels older than 3.6 and for CONFIG_HIGHMEM systems. This allocator is designed as a fallback for maximum compatibility. * Extended abdstats to provide visibility in the the allocator. * Add cached value for PAGESIZE in userspace. Contributions-by: Chunwei Chen <david.chen@osnexus.com> Gvozden Neskovic <neskovic@gmail.com> Jinshan Xiong <jinshan.xiong@intel.com> Isaac Huang <he.huang@intel.com> David Quigley <david.quigley@intel.com> Brian Behlendorf <behlendorf1@llnl.gov>	2016-11-29 14:34:33 -08:00
Chunwei Chen	4f60152910	ABD kmap to kmap_atomic Convert usage of kmap to kmap_atomic while correctly saving off irq state.	2016-11-29 14:34:33 -08:00
Romain Dolbeau	88cc2352ea	ABD raidz NEON support Port NEON implementation of RAID-Z functions to ABD. Signed-off-by: Roomain Dolbeau <romain.dolbeau@atos.net>	2016-11-29 14:34:33 -08:00
Gvozden Neskovic	65d71d4212	ABD raidz avx512f support Implement shift based multiplication for 512f. Higher IPC over lookup based methods yields up to 40% better performance on the current hardware. Results on Xeon Phi(TM) CPU 7210: implementation gen_p gen_pq gen_pqr rec_p rec_q rec_r rec_pq rec_pr rec_qr rec_pqr original 142232671 24411492 12948205 283053705 22348167 4215911 9171609 2265548 2378370 1648495 scalar 295711162 49851491 33253815 293198109 88179448 61866752 27941684 25764416 17384442 12138153 sse2 410055998 199642658 117973654 406240463 152688682 121092250 84968180 79291076 47473657 20779719 ssse3 411641595 199669571 117937647 406211024 137638508 117050346 81263322 76120405 46281559 32696722 avx2 616485806 311515332 188595628 605455115 260602390 230554476 148198817 138800254 92273356 62937819 avx512f 832191523 408509425 253599522 810094481 404325734 317590971 218235687 197204920 133101937 94001219 fastest avx512f avx512f avx512f avx512f avx512f avx512f avx512f avx512f avx512f avx512f Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-11-29 14:34:33 -08:00
Gvozden Neskovic	cbf484f8ad	ABD Vectorized raidz Enable vectorized raidz code on ABD buffers. The avx512f, avx512bw, neon and aarch64_neonx2 are disabled in this commit. With the exception of avx512bw these implementations are updated for ABD in the subsequent commits. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-11-29 14:34:33 -08:00
Gvozden Neskovic	a206522c4f	ABD changes for vectorized RAIDZ * userspace: aligned buffers. Minimum of 32B alignment is needed for AVX2. Kernel buffers are aligned 512B or more. * add abd_get_offset_size() interface * abd_iter_map(): fix calculation of iter_mapsize * add abd_raidz_gen_iterate() and abd_raidz_rec_iterate() Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-11-29 14:34:33 -08:00
Isaac Huang	b0be93e81a	ABD page support to vdev_disk.c Signed-off-by: Isaac Huang <he.huang@intel.com>	2016-11-29 14:34:32 -08:00
David Quigley	a6255b7fce	DLPX-44812 integrate EP-220 large memory scalability	2016-11-29 14:34:27 -08:00
DeHackEd	7ca25051b6	Kernel 4.9 compat: file_operations->aio_fsync removal Linux kernel commit 723c038475b78 removed this field. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #5393	2016-11-15 09:20:46 -08:00
luozhengzheng	32dec7bd1a	Fix coverity defects: CID 147503 CID 147503: Dereference after null check (FORWARD_NULL) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5326	2016-11-10 08:50:32 -08:00
cao	3bfd95d589	Fix coverity defects: CID 147540, 147542 CID 147540: unsigned_compare - Cast nsec to a int32_t to properly detect the expected overflow. CID 147542: unsigned_compare - intval can never be less than ZIO_FAILURE_MODE_WAIT which is defined to be zero. Remove this useless check. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5379	2016-11-09 17:35:26 -08:00
jxiong	126ae9f4e9	Export symbol dmu_objset_userobjspace_upgradable It's used by Lustre to determine if the objset can be upgraded. The inline version doesn't work because dmu_objset_is_snapshot() is not exported. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com> Closes #5385	2016-11-09 13:51:12 -08:00
tuxoko	0420c126ce	Linux 3.14 compat: assign inode->set_acl Linux 3.14 introduces inode->set_acl(). Normally, acl modification will come from setxattr, which will handle by the acl xattr_handler, and we already handles that well. However, nfsd will directly calls inode->set_acl or return error if it doesn't exists. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Massimo Maggi <me@massimo-maggi.eu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5371 Closes #5375	2016-11-09 10:37:17 -08:00
cao	a36cc8d242	Fix coverity defects: CID 147626, 147628 CID 147626: Type:Dereference before null check CID 147628: Type:Dereference before null check Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5304	2016-11-08 14:28:17 -08:00
Don Brady	976246fadd	Add illumos FMD ZFS logic to ZED -- phase 2 The phase 2 work primarily entails the Diagnosis Engine and the Retire Agent modules. It also includes infrastructure to support a crude FMD environment to host these modules. The Diagnosis Engine consumes I/O and checksum ereports and feeds them into a SERD engine which will generate a corres- ponding fault diagnosis when the SERD engine fires. All the diagnosis state data is collected into cases, one case per vdev being tracked. The Retire Agent responds to diagnosed faults by isolating the faulty VDEV. It will notify the ZFS kernel module of the new VDEV state (degraded or faulted). This agent is also responsible for managing hot spares across pools. When it encounters a device fault or a device removal it replaces the device with an appropriate spare if available. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com> Closes #5343	2016-11-07 15:01:38 -08:00
cao	f4bae2ed63	Fix coverity defects: CID 147575, 147577, 147578, 147579 CID 147575, Type:Unintentional integer overflow CID 147577, Type:Unintentional integer overflow CID 147578, Type:Unintentional integer overflow CID 147579, Type:Unintentional integer overflow Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5365	2016-11-07 14:54:32 -08:00
Chunwei Chen	8e71ab99dc	Batch free zpl_posix_acl_release Currently every calls to zpl_posix_acl_release will schedule a delayed task, and each delayed task will add a timer. This used to be fine except for possibly bad performance impact. However, in Linux 4.8, a new timer wheel implementation[1] is introduced. In this new implementation, the larger the delay, the less accuracy the timer is. So when we have a flood of timer from zpl_posix_acl_release, they will expire at the same time. Couple with the fact that task_expire will do linear search with lock held. This causes an extreme amount of contention inside interrupt and would actually lockup the system. We fix this by doing batch free to prevent a flood of delayed task. Every call to zpl_posix_acl_release will put the posix_acl to be freed on a lockless list. Every batch window, 1 sec, the zpl_posix_acl_free will fire up and free every posix_acl that passed the grace period on the list. This way, we only have one delayed task every second. [1] https://lwn.net/Articles/646950/ Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-11-07 11:04:44 -08:00
Brian Behlendorf	34328f3cf8	Allow 16M zio buffers in user space Only restrict the maximum zio alloc size to 32-bit kernel space. The same virtual address space limitations don't apply to user space. This resolves a memory allocation failure in raidz_test where it expects to be able to exercises all valid zio sizes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-11-07 10:26:17 -08:00
Chunwei Chen	ace1eae84c	Add support for O_TMPFILE Linux 3.11 add O_TMPFILE to open(2), which allow creating an unlinked file on supported filesystem. It's basically doing open(2) and unlink(2) atomically. The filesystem support is added through i_op->tmpfile. We basically copy the create operation except we get rid of the link and name related stuff and add the new node to unlinked set. We also add support for linkat(2) to link tmpfile. However, since all previous file operation will skip ZIL, we force a txg_wait_synced to make sure we are sync safe. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-11-04 10:46:40 -07:00
Chunwei Chen	987014903f	Fix unlinked file cannot do xattr operations Currently, doing things like fsetxattr(2) on an unlinked file will result in ENODATA. There's two places that cause this: zfs_dirent_lock and zfs_zget. The fix in zfs_dirent_lock is pretty straightforward. In zfs_zget though, we need it to not return error when the zp is unlinked. This is a pretty big change in behavior, but skimming through all the callers, I don't think this change would cause any problem. Also there's nothing preventing z_unlinked from being set after the z_lock mutex is dropped before but before zfs_zget returns anyway. The rest of the stuff is to make sure we don't log xattr stuff when owner is unlinked. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-11-04 10:46:40 -07:00
Romain Dolbeau	7f547f85fe	Add parity generation/rebuild using AVX-512 for x86-64 avx512f should work on all AVX512 hardware, since it only uses Foundation instructions. avx512bw should be faster on hardware supporting the AVW512BW extension. We can use full-width pshufb (instead of relying on the 256 bits AVX2 pshufb). As a side-effect, the code is also unrolled more. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.github@dolbeau.name> Closes #5219	2016-11-02 12:40:23 -07:00
BearBabyLiu	6d4210052b	Fix dsl_prop_get_all_dsl() memory leak On error dsl_prop_get_all_ds() does not free the nvlist it allocates. This behavior may have been intentional when originally written but is atypical and often confusing. Since no callers rely on this behavior the function has been updated to always free the nvlist on error. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: BearBabyLiu <liu.huang@zte.com.cn> Closes #5320	2016-11-02 12:34:10 -07:00
Brian Behlendorf	9edb36954a	Use vmem_size() for 32-bit systems On 32-bit Linux systems use vmem_size() to correctly size the ARC and better determine when IO should be throttle due to low memory. On 64-bit systems this change has no effect since the virtual address space available far exceeds the physical memory available. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5347	2016-11-02 12:14:45 -07:00
Brian Behlendorf	82ec9d41d8	Fix 32-bit maximum volume size A limit of 1TB exists for zvols on 32-bit systems. Update the code to correctly reflect this limitation in a similar manor as the OpenZFS implementation. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5347	2016-11-02 12:14:45 -07:00
Brian Behlendorf	4990e576c6	Enable .zfs/snapshot for 32-bit systems Originally the .zfs/snapshot directory was disabled for 32-bit systems because 64-bit inode numbers were not supported. This is no longer the case and this functionality can be enabled by default. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5347 Closes #2002	2016-11-02 12:14:45 -07:00
Brian Behlendorf	48d3eb40c7	Add TASKQID_INVALID Add the TASKQID_INVALID macros and update callers to use the macro instead of testing against 0. There is no functional change even though the functions in zfs_ctldir.c incorrectly used -1 instead of 0. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5347	2016-11-02 12:14:45 -07:00
cao	51c9163f98	Fix sa_legacy_attr_count to use ARRAY_SIZE Replace magic value 16 with ARRAY_SIZE() to correctly handle when the sa_legacy_attrs array size changes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5354	2016-11-02 10:26:12 -07:00
cao	981b21260e	Fix coverity defects: CID 147553 CID 147553: Type:Dereference null return value Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5305	2016-11-01 10:20:24 -07:00
cao	b182ac00aa	Fix coverity defects: CID 152975 CID 152975: Type:Dereference null return value Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5322	2016-10-31 16:23:56 -07:00
GeLiXin	4aafab91c5	Fix coverity defects: CID 147509 CID 147509: Explicit null dereferenced - l2arc_sublist_lock is fragile as relied on caller too much. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Closes #5319	2016-10-31 16:04:01 -07:00
Hajo Möller	e02aaf17f1	Fix lookup_bdev() on Ubuntu Ubuntu added support for checking inode permissions to lookup_bdev() in kernel commit 193fb6a2c94fab8eb8ce70a5da4d21c7d4023bee (merged in 4.4.0-6.21). Upstream bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1636517 This patch adds a test for Ubuntu's variant of lookup_bdev() to configure and calls the function in the correct way. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Hajo Möller <dasjoe@gmail.com> Closes #5336	2016-10-26 10:30:43 -07:00
Brian Behlendorf	76a87a902e	Disable zio_dva_throttle_enabled by default Until it can be determined definitively that a performance regression wasn't introduced accidentally by `3dfb57a` this functionality is being disabled by default. It can be re- enabled by setting zio_dva_throttle_enabled=1. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5335 Issue #5289	2016-10-26 09:13:43 -07:00
Tony Hutter	6568379eea	Fix statechange-led.sh & unnecessary libdevmapper warning - Fix autoreplace behaviour on statechange-led.sh script. ZED sends the following events on an auto-replace: 1. statechange: Disk goes UNAVAIL->ONLINE 2. statechange: Disk goes ONLINE->UNAVAIL 3. vdev_attach: Disk goes ONLINE Events 1-2 happen when ZED first attempts to do an auto-online. When that fails, ZED then tries an auto-replace, generating the vdev_attach event in #3. In the previous code, statechange-led was only looking at the UNAVAIL->ONLINE transition to turn off the LED. It ignored the #2 ONLINE->UNAVAIL transition, assuming it was just the "old" VDEV going offline. This is problematic, as a drive can go from ONLINE->UNAVAIL when it's malfunctioning, and we don't want to ignore that. This new patch correctly turns on the fault LED every time a drive becomes UNAVAIL. It also monitors vdev_attach events to trigger turning off the LED when an auto-replaced disk comes online. - Remove unnecessary libdevmapper warning with --with-config=kernel This fixes an unnecessary libdevmapper warning when building --with-config=kernel. Kernel code does not use libdevmapper, so the warning is not needed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #2375 Closes #5312 Closes #5331	2016-10-25 11:05:30 -07:00
tuxoko	9fa4db44b7	Fix cred leak in zpl_fallocate_common This is caught by kmemleak when running compress_004_pos Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5244 Closes #5330	2016-10-24 16:41:56 -07:00
Brian Behlendorf	13d9a004fe	Fix taskq creation failure in vdev_open_children() When creating and destroying pools in tight loop it's possible to exhaust the number of allowed threads on a system. This results in taskq_create() failling and a NULL dereference. Resolve the issue by falling back to opening the vdevs all synchronously. Reviewed-by: Denys Rtveliashvili <denys@rtveliashvili.name> Reviewed-by: Håkan Johansson <f96hajo@chalmers.se> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/spl#521 Closes #4637	2016-10-24 13:28:58 -07:00
Tony Hutter	1bbd877049	Turn on/off enclosure slot fault LED even when disk isn't present Previously when a drive faulted, the statechange-led.sh script would lookup the drive's LED sysfs entry in /sys/block/sd/device/enclosure_device, and turn it on. During testing we noticed that if you pulled out a drive, or if the drive was so badly broken that it no longer appeared to Linux, that the /sys/block/sd path would be removed, and the script could not lookup the LED entry. To fix this, this patch looks up the disks's more persistent "/sys/class/enclosure/X:X:X:X/Slot N" LED sysfs path at pool import. It then passes that path to the statechange-led script to use, rather than having the script look it up on the fly. This allows the script to turn on/off the slot LEDs even when the drive is missing. Closes #5309 Closes #2375	2016-10-24 10:45:59 -07:00
Brian Behlendorf	e4ffa98dca	Fix userquota_compare() function The AVL tree compare function requires that either -1, 0, or 1 be returned. However the strcmp() function only guarantees that a negative, zero, or positive value is returned. Therefore, the return value of strcmp() needs to be sanitized with AVL_ISIGN. This was initially overlooked because the x86_64 implementation of strcmp() happens to only returns the allowed values. This was observed on an aarch64 platform which behaves correctly but differently as described above. Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5311 Closes #5313	2016-10-21 08:23:27 -07:00
luozhengzheng	9523b15ac1	Fix coverity defects: CID 153459 CID 153459: Null pointer dereferences (FORWARD_NULL) Accidentally introduced by #5159. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5310	2016-10-20 11:54:02 -07:00
cao	9d01680430	Fix coverity defects: CID 147551, 147552 CID 147551: Type:dereference null return value CID 147552: Type:dereference null return value Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5279	2016-10-20 11:49:50 -07:00
cao	5a6765cf8c	Fix coverity defects: CID 147472 CID 147472: Type: 'Constant' variable guards dead code Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5288	2016-10-20 11:24:01 -07:00
Brian Behlendorf	3b0ba3ba99	Linux 4.9 compat: inode_change_ok() renamed setattr_prepare() In torvalds/linux@31051c8 the inode_change_ok() function was renamed setattr_prepare() and updated to take a dentry ratheri than an inode. Update the code to call the setattr_prepare() and add a wrapper function which call inode_change_ok() for older kernels. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Requires-spl: refs/pull/581/head	2016-10-20 09:39:09 -07:00
Chunwei Chen	0fedeedd30	Linux 4.9 compat: remove iops->{set,get,remove}xattr In Linux 4.9, torvalds/linux@fd50eca, iops->{set,get,remove}xattr and generic_{set,get,remove}xattr are removed. xattr operations will directly go through sb->s_xattr. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-10-20 09:39:09 -07:00
Chunwei Chen	b8d9e26440	Linux 4.9 compat: iops->rename() wants flags In Linux 4.9, torvalds/linux@2773bf0, iops->rename() and iops->rename2() are merged together into iops->rename(), it now wants flags. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-10-20 09:39:09 -07:00
Chunwei Chen	8ba3f2bf6a	Remove dir inode operations from zpl_inode_operations These operations are dir specific, there's no point putting them in zpl_inode_operations which is for regular files. Signed-off-by: Chunwei Chen <david.chen@osnexus.com>	2016-10-20 09:39:09 -07:00
Tony Hutter	6078881aa1	Multipath autoreplace, control enclosure LEDs, event rate limiting 1. Enable multipath autoreplace support for FMA. This extends FMA autoreplace to work with multipath disks. This requires libdevmapper to be installed at build time. 2. Turn on/off fault LEDs when VDEVs become degraded/faulted/online Set ZED_USE_ENCLOSURE_LEDS=1 in zed.rc to have ZED turn on/off the enclosure LED for a drive when a drive becomes FAULTED/DEGRADED. Your enclosure must be supported by the Linux SES driver for this to work. The enclosure LED scripts work for multipath devices as well. The scripts will clear the LED when the fault is cleared. 3. Rate limit ZIO delay and checksum events so as not to flood ZED ZIO delay and checksum events are rate limited to 5/sec in the zfs module. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #2449 Closes #3017 Closes #5159	2016-10-19 12:55:59 -07:00
luozhengzheng	7c502b0b1d	Fix coverity defects: CID 150926 CID 150926: Unchecked return value (CHECKED_RETURN) - This case cannot occur given the existing taskq implementation and flags passed to task_dispatch(). Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5272	2016-10-18 11:32:59 -07:00
Brian Behlendorf	6d00b5e136	Fix unused variable Accidentally introduced by `3dfb57a`, when building with debugging disabled several variables are unused. Resolve this by wrapping them in ASSERTV to remove them for non-debug builds. Reviewed by: Don Brady <don.brady@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5284	2016-10-18 10:44:44 -07:00
cao	1b81ab46d0	Fix coverity defects: CID 49339, 153393 CID 49339: Type:Buffer not null terminated CID 153393: Type:Buffer not null terminated Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: <cao.xuewen cao.xuewen@zte.com.cn> Closes #5296	2016-10-18 10:31:57 -07:00
luozhengzheng	b60eac3d1a	Fix coverity defects: CID 150924 CID 150924: Unchecked return value (CHECKED_RETURN) - On taskq_dispatch failure the reference must be dropped and this entry can be safely skipped. This case should be impossible in the existing implementation but should be handled regardless. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5278	2016-10-17 12:03:52 -07:00
cao	b6ca6193f7	Fix coverity defects: CID 147488, 147490 CID 147488, Type:explicit null dereferenced CID 147490, Type:dereference null return value Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5237	2016-10-14 11:00:47 -07:00
Don Brady	3dfb57a35e	OpenZFS 7090 - zfs should throttle allocations OpenZFS 7090 - zfs should throttle allocations Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Ported-by: Don Brady <don.brady@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> When write I/Os are issued, they are issued in block order but the ZIO pipeline will drive them asynchronously through the allocation stage which can result in blocks being allocated out-of-order. It would be nice to preserve as much of the logical order as possible. In addition, the allocations are equally scattered across all top-level VDEVs but not all top-level VDEVs are created equally. The pipeline should be able to detect devices that are more capable of handling allocations and should allocate more blocks to those devices. This allows for dynamic allocation distribution when devices are imbalanced as fuller devices will tend to be slower than empty devices. The change includes a new pool-wide allocation queue which would throttle and order allocations in the ZIO pipeline. The queue would be ordered by issued time and offset and would provide an initial amount of allocation of work to each top-level vdev. The allocation logic utilizes a reservation system to reserve allocations that will be performed by the allocator. Once an allocation is successfully completed it's scheduled on a given top-level vdev. Each top-level vdev maintains a maximum number of allocations that it can handle (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab groups and round robin across all eligible metaslab groups to distribute the work. As top-levels complete their work, they receive additional work from the pool-wide allocation queue until the allocation queue is emptied. OpenZFS-issue: https://www.illumos.org/issues/7090 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4756c3d7 Closes #5258 Porting Notes: - Maintained minimal stack in zio_done - Preserve linux-specific io sizes in zio_write_compress - Added module params and documentation - Updated to use optimize AVL cmp macros	2016-10-13 17:59:18 -07:00
cao	3f93077b02	Fix coverity defects: CID 150943, 150938 CID:150943, Type:Unintentional integer overflow CID:150938, Type:Explicit null dereferenced Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5255	2016-10-13 14:30:50 -07:00
luozhengzheng	05852b3467	Fix coverity defects: CID 147571, 147574 CID 147571: Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN) CID 147574: Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5268	2016-10-13 14:25:05 -07:00
luozhengzheng	1f51b525ff	Fix coverity defects: CID 153394 coverity scan CID 153394, Type:String overflow Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5263	2016-10-12 13:24:03 -07:00
Brian Behlendorf	1697d2dcf1	Fix zfsctl_snapshot_{,un}mount() issues Fix use after free in zfsctl_snapshot_unmount(). Use /usr/bin/env instead of /bin/sh to fix a shell code injection flaw and allow use with grsecurity. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Stian Ellingsen <stian@plaimi.net> Closes #5250 Closes #4377	2016-10-11 09:56:28 -07:00
Tim Chase	d33931a83a	Write issue taskq shouldn't be dynamic This is as much an upstream compatibility as it's a bit of a performance gain. The illumos taskq implemention doesn't allow a TASKQ_THREADS_CPU_PCT type to be dynamic and in fact enforces as much with an ASSERT. As to performance, if this taskq is dynamic, it can cause excessive contention on tq_lock as the threads are created and destroyed because it can see bursts of many thousands of tasks in a short time, particularly in heavy high-concurrency zvol write workloads. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #5236	2016-10-10 15:19:14 -07:00
Brian Behlendorf	7515f8f63d	Fix file permissions The following new test cases need to have execute permissions set: userquota/groupspace_003_pos.ksh userquota/userquota_013_pos.ksh userquota/userspace_003_pos.ksh upgrade/upgrade_userobj_001_pos.ksh upgrade/setup.ksh upgrade/cleanup.ksh The following source files accidentally were marked executable: lib/libzpool/kernel.c lib/libshare/nfs.c lib/libzfs/libzfs_dataset.c lib/libzfs/libzfs_util.c tests/zfs-tests/cmd/rm_lnkcnt_zero_file/rm_lnkcnt_zero_file.c tests/zfs-tests/cmd/dir_rd_update/dir_rd_update.c cmd/zed/zed_exec.c module/icp/core/kcf_sched.c module/zfs/dsl_pool.c module/zfs/arc.c module/nvpair/nvpair.c man/man5/zfs-module-parameters.5 Reviewed-by: GeLiXin <ge.lixin@zte.com.cn> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5241	2016-10-08 14:57:56 -07:00
Stian Ellingsen	5dc1ff29ec	Use env, not sh in zfsctl_snapshot_{,un}mount() Call mount and umount via /usr/bin/env instead of /bin/sh in zfsctl_snapshot_mount() and zfsctl_snapshot_unmount(). This change fixes a shell code injection flaw. The call to /bin/sh passed the mountpoint unescaped, only surrounded by single quotes. A mountpoint containing one or more single quotes would cause the command to fail or potentially execute arbitrary shell code. This change also provides compatibility with grsecurity patches. Grsecurity only allows call_usermodehelper() to use helper binaries in certain paths. /usr/bin/* is allowed, /bin/* is not.	2016-10-08 17:43:29 +02:00
Stian Ellingsen	00b65db711	Fix use after free in zfsctl_snapshot_unmount()	2016-10-08 17:42:52 +02:00
Brian Behlendorf	690fe6479e	Rename hole_birth tunable to match OpenZFS OpenZFS decided that ignore_hole_birth was too imprecise and incorrect a name (and went with send_holes_without_birth_time). Rename it in ZoL too, while keeping the name "ignore_hole_birth" pointing to the same variable for existing consumers. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #5239	2016-10-07 21:02:24 -07:00
Håkan Johansson	4770aa0643	Fix vdev_open_child() race on updating vdev_parent->vdev_nonrot Updating vd->vdev_parent->vdev_nonrot in vdev_open_child() is a race when vdev_open_child is called for many children from a task queue. vdev_open_child() is only called by vdev_open_children(), let the latter update the parent vdev_nonrot member. The update was already there, so done twice previously. Thus using the same logic at the end in vdev_open_children() to update vdev_nonrot, either we are vdev_uses_zvols() or not. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Haakan T Johansson <f96hajo@chalmers.se> Closes #5162	2016-10-07 13:25:35 -07:00
cao	ccc92611b1	Fix coverity defects: CID 147565-147567 coverity scan CID:147567, Type:dereference null return value coverity scan CID:147566, Type:dereference null return value coverity scan CID:147565, Type:dereference null return value Reviewed by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5166	2016-10-07 13:19:43 -07:00
Jinshan Xiong	9b7a83cbb6	OpenZFS 6988 spa_sync() spends half its time in dmu_objset_do_userquota_updates Using a benchmark which creates 2 million files in one TXG, I observe that the thread running spa_sync() is on CPU almost the entire time we are syncing, and therefore can be a performance bottleneck. About 50% of the time in spa_sync() is in dmu_objset_do_userquota_updates(). The problem is that dmu_objset_do_userquota_updates() calls zap_increment_int(DMU_USERUSED_OBJECT) once for every file that was modified (or created). In this benchmark, all the files are owned by the same user/group, so all 2 million calls to zap_increment_int() are modifying the same entry in the zap. The same issue exists for the DMU_GROUPUSED_OBJECT. We should keep an in-memory map from user to space delta while we are syncing, and when we finish, iterate over the in-memory map and modify the ZAP once per entry. This reduces the number of calls to zap_increment_int() from "number of objects modified" to "number of owners/groups of modified files". This reduced the time spent in spa_sync() in the file create benchmark by ~33%, from 11 seconds to 7 seconds. Upstream bugs: DLPX-44799 Ported by: Ned Bass <bass6@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6988 ZFSonLinux-issue: https://github.com/zfsonlinux/zfs/issues/4642 OpenZFS-commit: unmerged Porting notes: - Added curly braces around declaration of userquota_cache_t cache to quiet compiler warning; - Handled the userobj accounting the same way it proposed in this path. Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>	2016-10-07 09:45:13 -07:00
Jinshan Xiong	1de321e626	Add support for user/group dnode accounting & quota This patch tracks dnode usage for each user/group in the DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode accounting have the key prefixed with "obj-" followed by the UID/GID in string format (as done for the block accounting). A new SPA feature has been added for dnode accounting as well as a new ZPL version. The SPA feature must be enabled in the pool before upgrading the zfs filesystem. During the zfs version upgrade, a "quotacheck" will be executed by marking all dnode as dirty. ZoL-bug-id: https://github.com/zfsonlinux/zfs/issues/3500 Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com> Signed-off-by: Johann Lombardi <johann.lombardi@intel.com>	2016-10-07 09:45:13 -07:00
lorddoskias	64c688d716	Refactor updating of immutable/appendonly flags Move the synchronization of inode/znode i_flgas/pflags into the respective internal zfs function. This is mostly mechanical work and shouldn't introduce any functional changes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Issue #227 Closes #5223	2016-10-05 14:47:29 -07:00
ilovezfs	125a406e24	OpenZFS 6585 - sha512, skein, and edonr have an unenforced dependency on extensible dataset Authored by: ilovezfs <ilovezfs@icloud.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported by: Tony Hutter <hutter2@llnl.gov> In any pool without the extensible dataset feature flag already enabled, creating a dataset with dedup set to use one of the new checksums would result in the following panic as soon as any data was added: panic[cpu0]/thread=ffffff0006761c40: feature_get_refcount(spa, feature, &refcount) != 48 (0x30 != 0x30), file: ../../common/fs/zfs/zfeature.c line 390 Inpsection showed that feature->fi_feature was 7, which is the value of SPA_FEATURE_EXTENSIBLE_DATASET in the spa_feature enum. This commit adds extensible dataset as a dependency for the sha512, edonr, and skein feature flags, which prevents the panic. OpenZFS-issue: https://www.illumos.org/issues/6585 OpenZFS-commit: `892586e8a1` Porting Notes: This code was originally from Illumos, but I actually ported it from: openzfsonosx/zfs@b62a652	2016-10-03 14:51:21 -07:00
ilovezfs	4a2e9a17d5	OpenZFS 6541 - Pool feature-flag check defeated if "verify" is included in the dedup property value Authored by: ilovezfs <ilovezfs@icloud.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Tony Hutter <hutter2@llnl.gov> zio_checksum_to_feature() expects a zio_checksum enum not a raw property intval, so the new checksums weren't being detected when the ZIO_CHECKSUM_VERIFY flag got in the way. Given a pool without feature@sha512, zfs create -o dedup=sha512 naughty/fivetwelve_noverify_ds would fail as expected since the raw intval would indeed be equal to SPA_FEATURE_SHA512. However, zfs create -o dedup=sha512,verify naughty/fivetwelve_verify_ds would incorrectly succeed because ZIO_CHECKSUM_VERIFY would be in the way, the raw intval would not be a member of the enum, and zio_checksum_to_feature() would return SPA_FEATURE_NONE, with the result that spa_feature_is_enabled() would never be called. This was first detected with edonr, since in that case verify is required. This commit clears the ZIO_CHECKSUM_VERIFY flag before calling zio_checksum_to_feature() using the ZIO_CHECKSUM_MASK and verifies in zio_checksum_to_feature() that ZIO_CHECKSUM_MASK has been applied by the caller to attempt to prevent the same bug from occurring again in the future. OpenZFS-issue: https://www.illumos.org/issues/6541 OpenZFS-commit: `971640e6aa` Porting notes: This code was originally from Illumos, but I actually ported it from: openzfsonosx/zfs@bef06e1	2016-10-03 14:51:21 -07:00
Tony Hutter	3c67d83a8a	OpenZFS 4185 - add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Garrett D'Amore <garrett@damore.org> Ported by: Tony Hutter <hutter2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/4185 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee Porting Notes: This code is ported on top of the Illumos Crypto Framework code: `b5e030c8db` The list of porting changes includes: - Copied module/icp/include/sha2/sha2.h directly from illumos - Removed from module/icp/algs/sha2/sha2.c: #pragma inline(SHA256Init, SHA384Init, SHA512Init) - Added 'ctx' to lib/libzfs/libzfs_sendrecv.c:zio_checksum_SHA256() since it now takes in an extra parameter. - Added CTASSERT() to assert.h from for module/zfs/edonr_zfs.c - Added skein & edonr to libicp/Makefile.am - Added sha512.S. It was generated from sha512-x86_64.pl in Illumos. - Updated ztest.c with new fletcher_4_() args; used NULL for new CTX argument. - In icp/algs/edonr/edonr_byteorder.h, Removed the #if defined(__linux) section to not #include the non-existant endian.h. - In skein_test.c, renane NULL to 0 in "no test vector" array entries to get around a compiler warning. - Fixup test files: - Rename <sys/varargs.h> -> <varargs.h>, <strings.h> -> <string.h>, - Remove <note.h> and define NOTE() as NOP. - Define u_longlong_t - Rename "#!/usr/bin/ksh" -> "#!/bin/ksh -p" - Rename NULL to 0 in "no test vector" array entries to get around a compiler warning. - Remove "for isa in $($ISAINFO); do" stuff - Add/update Makefiles - Add some userspace headers like stdio.h/stdlib.h in places of sys/types.h. - EXPORT_SYMBOL _Init/_Update/_Final... routines in ICP modules. - Update scripts/zfs2zol-patch.sed - include <sys/sha2.h> in sha2_impl.h - Add sha2.h to include/sys/Makefile.am - Add skein and edonr dirs to icp Makefile - Add new checksums to zpool_get.cfg - Move checksum switch block from zfs_secpolicy_setprop() to zfs_check_settable() - Fix -Wuninitialized error in edonr_byteorder.h on PPC - Fix stack frame size errors on ARM32 - Don't unroll loops in Skein on 32-bit to save stack space - Add memory barriers in sha2.c on 32-bit to save stack space - Add filetest_001_pos.ksh checksum sanity test - Add option to write psudorandom data in file_write utility	2016-10-03 14:51:15 -07:00
Romain Dolbeau	62a65a654e	Add parity generation/rebuild using 128-bits NEON for Aarch64 This re-use the framework established for SSE2, SSSE3 and AVX2. However, GCC is using FP registers on Aarch64, so unlike SSE/AVX2 we can't rely on the registers being left alone between ASM statements. So instead, the NEON code uses C variables and GCC extended ASM syntax. Note that since the kernel explicitly disable vector registers, they have to be locally re-enabled explicitly. As we use the variable's number to define the symbolic name, and GCC won't allow duplicate symbolic names, numbers have to be unique. Even when the code is not going to be used (e.g. the case for 4 registers when using the macro with only 2). Only the actually used variables should be declared, otherwise the build will fails in debug mode. This requires the replacement of the XOR(X,X) syntax by a new ZERO(X) macro, which does the same thing but without repeating the argument. And perhaps someday there will be a machine where there is a more efficient way to zero a register than XOR with itself. This affects scalar, SSE2, SSSE3 and AVX2 as they need the new macro. It's possible to write faster implementations (different scheduling, different unrolling, interleaving NEON and scalar, ...) for various cores, but this one has the advantage of fitting in the current state of the code, and thus is likely easier to review/check/merge. The only difference between aarch64-neon and aarch64-neonx2 is that aarch64-neonx2 unroll some functions some more. Reviewed-by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net> Closes #4801	2016-10-03 09:44:00 -07:00
luozhengzheng	aecdc70604	Fix coverity defects: CID 147448, 147449, 147450, 147453, 147454 coverity scan CID:147448,type: unchecked return value coverity scan CID:147449,type: unchecked return value coverity scan CID:147450,type: unchecked return value coverity scan CID:147453,type: unchecked return value coverity scan CID:147454,type: unchecked return value Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5206	2016-10-02 11:24:54 -07:00
cao	0a8f18f932	Fix coverity defects: CID 147563, 147560 coverity scan CID:147563, Type:dereference null return value coverity scan CID:147560, Type:dereference null return value Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5168	2016-09-30 15:56:17 -07:00
Brian Behlendorf	2db28197fe	Fix cppcheck warning in buf_init() Cppcheck 1.63 erroneously complains about an uninitialized value in buf_init(). Newer versions of cppcheck (1.72) handle this correctly but we'll initialize the value anyway to silence the warning. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5203	2016-09-30 15:04:21 -07:00
Gvozden Neskovic	6ca636a152	Avoid undefined shift overflow in fzap_cursor_retrieve() Avoid calculating (1<<64) if lh_prefix_len == 0. Semantics of the method remain the same. Assert (lh_prefix_len > 0) in zap_expand_leaf() to detect possibly the same problem. Issue #4883 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-09-29 15:55:41 -07:00
Gvozden Neskovic	4ca9c1de12	Explicit integer promotion for bit shift operations Explicitly promote variables to correct type. Undefined behavior is reported because length of int is not well defined by C standard. Issue #4883 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-09-29 15:55:41 -07:00
Gvozden Neskovic	031d7c2fe6	fix: Shift exponent too large Undefined operation is reported by running ztest (or zloop) compiled with GCC UndefinedBehaviorSanitizer. Error only happens on top level of dnode indirection with large enough offset values. Logically, left shift operation would work, but bit shift semantics in C, and limitation of uint64_t, do not produce desired result. Issue #5059, #4883 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-09-29 15:55:41 -07:00
Isaac Huang	e8ac4557af	Explicit block device plugging when submitting multiple BIOs Without plugging, the default 'noop' scheduler will not merge the BIOs which are part of a large ZIO. Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Isaac Huang <he.huang@intel.com> Closes #5181	2016-09-29 13:13:31 -07:00
cao	c9d61adbf8	Fix coverity defects: 147658, 147652, 147651 coverity scan CID:147658, Type:copy into fixed size buffer. coverity scan CID:147652, Type:copy into fixed size buffer. coverity scan CID:147651, Type:copy into fixed size buffer. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5160	2016-09-29 12:06:14 -07:00
lorddoskias	12fa7f3436	Refactor inode->i_mode management Refactor the code in such a way so that inode->i_mode is being set at the same time zp->z_mode is being changed. This has the effect of keeping both in sync without relying on zfs_inode_update. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Closes #5158	2016-09-27 14:08:52 -07:00
cao	680eada9b0	Fix coverity defects: CID 147650, 147649, 147647, 147646 coverity scan CID:147650, Type:copy into fixed size buffer. coverity scan CID:147649, Type:copy into fixed size buffer. coverity scan CID:147647, Type:copy into fixed size buffer. coverity scan CID:147646, Type:copy into fixed size buffer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5161	2016-09-25 15:08:28 -07:00
Brian Behlendorf	7571033285	Fix multilist_create() memory leak In arc_state_fini() the `arc_l2c_only->arcs_list[*]` multilists must be destroyed. This accidentally regressed in `d3c2ae1c`. Reviewed by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5151 Closes #5152	2016-09-23 10:55:10 -07:00
tuxoko	d5b897a6a1	Linux 4.7 compat: Fix deadlock during lookup on case-insensitive We must not use d_add_ci if the dentry already has the real name. Otherwise, d_add_ci()->d_alloc_parallel() will find itself on the lookup hash and wait on itself causing deadlock. Tested-by: satmandu Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5124 Closes #5141 Closes #5147 Closes #5148	2016-09-22 19:09:16 -07:00
kernelOfTruth aka. kOT, Gentoo user	51907a31bc	OpenZFS 7230 - add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records Authored by: Matt Krantz <matt.krantz@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: kernelOfTruth <kerneloftruth@gmail.com> OpenZFS-issue: https://www.illumos.org/issues/7230 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/12b90ee2 Closes #5112	2016-09-22 16:01:19 -07:00
luozhengzheng	160987b576	Fix coverity defects coverity scan CID:147633,type: sizeof not portable coverity scan CID:147637,type: sizeof not portable coverity scan CID:147638,type: sizeof not portable coverity scan CID:147640,type: sizeof not portable In these particular cases sizeof (XX *) happens to be equal to sizeof (X ), but this is not a portable assumption. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5144	2016-09-21 18:09:00 -07:00
Isaac Huang	da8d57488b	Reduce noise in tracing logs dbuf_read_impl() returns (SET_ERROR(err)) when err can be 0, which adds lots of noise in tracing logs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Isaac Huang <he.huang@intel.com> Closes #4430 Closes #5146	2016-09-21 13:37:20 -07:00
BearBabyLiu	609603a5d3	Fix coverity defects coverity scan CID:147504 Type: Explicit null dereferenced Reason: passing null pointer dl to zfs_dirent_unlock Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: BearBabyLiu <liu.huang@zte.com.cn> Closes #5131	2016-09-20 19:09:22 -07:00
Tim Chase	25e2ab16be	Fix arc_adjust_meta_balanced() The type of "adjustmnt" was erroneously changed to unsigned when the compressed ARC code was ported in `d3c2ae1c08`. As a result of it being unsigned, the balanced metadata eviction logic would evict all of the non-metadata. Reviewed-by: Chris Severance <github.severach@spamgourmet.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: David Quigley <david.quigley@intel.com> Signed-off-by: Tim Chase <tim@onlight.com> Closes #5128 Closes #5129	2016-09-19 09:28:35 -07:00
luozhengzheng	30f3f2e13c	Fix Coverity defects CID 147659, 150952 and 147645 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5103	2016-09-17 15:08:54 -07:00
Brian Behlendorf	9ea9e0b9a1	Enable ignore_hole_birth module option by default Enable ignore_hole_birth by default until all known hole birth bugs have been resolved and relevant test cases added. Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4809 Closes #5099	2016-09-16 14:05:30 -07:00
Nikolay Borisov	87f9371aef	Simplify time handling logic in zfs_settattr Simplify time handling in zfs_setattr by mimicking the logic in setattr_copy from the linux kernel. In order to achieve this in the case when ZFS' log is being replayed it is necessary to unconditionally set the ctime in zfs_replay_setattr. Also use the timespec_trunc function when assigning values to the generic inode struct. This is currently a noop since zfs sets s_time_gran to 1, however in the future rules about precision might change. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Closes #4916	2016-09-13 12:00:18 -07:00
Nikolay Borisov	9f5f0019ab	Refactor generic inode time updating ZFS doesn't provide a custom update_time method meaning it delegates this job to the generic VFS layer. The only time when it needs to set the various *time values is when the inode is being marshalled to/from the disk. Do this by moving the relevant code from zfs_inode_update_impl to zfs_node_alloc and zfs_rezget. As a result from this change it is no longer necessary to have multiple versions of the zfs_inode_update function - so just nuke them and leave only one. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Issue #227 Closes #4916	2016-09-13 11:57:37 -07:00
Dan Kimmel	524b4217b8	DLPX-44733 combine arc_buf_alloc_impl() with arc_buf_clone() Authored by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: David Quigley <david.quigley@intel.com> Issue #5078	2016-09-13 09:59:13 -07:00
Tom Caputi	c17bcf83da	Enable raw writes to perform dedup with verification Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: David Quigley <david.quigley@intel.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue #5078	2016-09-13 09:59:04 -07:00
Dan Kimmel	2aa34383b9	DLPX-40252 integrate EP-476 compressed zfs send/receive Authored by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: David Quigley <david.quigley@intel.com> Issue #5078	2016-09-13 09:58:58 -07:00
George Wilson	d3c2ae1c08	OpenZFS 6950 - ARC should cache compressed data Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: David Quigley <david.quigley@intel.com> This review covers the reading and writing of compressed arc headers, sharing data between the arc_hdr_t and the arc_buf_t, and the implementation of a new dbuf cache to keep frequently access data uncompressed. I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block for that DVA. The physical block may or may not be compressed. If compressed arc is enabled and the block on-disk is compressed, then the b_pdata will match the block on-disk and remain compressed in memory. If the block on disk is not compressed, then neither will the b_pdata. Lastly, if compressed arc is disabled, then b_pdata will always be an uncompressed version of the on-disk block. Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict any arc_buf_t's that are no longer referenced. This means that the arc will primarily have compressed blocks as the arc_buf_t's are considered overhead and are always uncompressed. When a consumer reads a block we first look to see if the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t and bcopy the uncompressed contents from the first arc_buf_t to the new one. Writing to the compressed arc requires that we first discard the b_pdata since the physical block is about to be rewritten. The new data contents will be passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we will copy the physical block contents to a newly allocated b_pdata. When an l2arc is inuse it will also take advantage of the b_pdata. Now the l2arc will always write the contents of b_pdata to the l2arc. This means that when compressed arc is enabled that the l2arc blocks are identical to those stored in the main data pool. This provides a significant advantage since we can leverage the bp's checksum when reading from the l2arc to determine if the contents are valid. If the compressed arc is disabled, then we must first transform the read block to look like the physical block in the main data pool before comparing the checksum and determining it's valid. OpenZFS-issue: https://www.illumos.org/issues/6950 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7fc10f0 Issue #5078	2016-09-13 09:58:33 -07:00
Tim Chase	43924bfeaa	Remove redundant assignments to arc_c Several assignments to arc_c had no effect because it is ultimately initialized to arc_c_max. This aligns ZoL better with the upstream code which removed these assignments some time ago. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@onlight.com> Closes #5081	2016-09-12 12:40:30 -07:00
Nikolay Borisov	67d6082494	Refactor spa_load_l2cache to make build happy In case sav->sav_config was NULL the body of the function would skip the iteration of the l2 cache devices and will just cleanup the old devices. However, this wasn't very obvious since the null check was performed after the loop body and after the old devices were cleaned. Refactor the code so that it's now obvious when the iteration of the l2cache devices is skipped. This fixes the following cppcheck warning: [module/zfs/spa.c:1552]: (error) Possible null pointer dereference: newvdevs Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Closes #5087	2016-09-12 12:40:03 -07:00
Tim Chase	20aa7a4e31	Free property names with spa_strfree() rather than strfree() Since they're allocated with spa_strdup(), they should be freed with spa_strfree() so the proper length buffer is freed. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #5082 Closes #5086	2016-09-12 09:45:26 -07:00
Don Brady	d02ca37979	Bring over illumos ZFS FMA logic -- phase 1 This first phase brings over the ZFS SLM module, zfs_mod.c, to handle auto operations in response to disk events. Disk event monitoring is provided from libudev and generates the expected payload schema for zfs_mod. This work leverages the recently added devid and phys_path strings in the vdev label. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #4673	2016-09-01 11:39:45 -07:00
luozhengzheng	0b284702b7	Delete unreferenced function zfs_ereport_send_interim_checksum Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5055	2016-09-01 11:39:45 -07:00
luozhengzheng	ca8587a517	kmem_zalloc with KM_SLEEP will never return NULL These allocations can never fail. Leaving the error handling code here gives the impression they can so it has been removed. Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5048	2016-09-01 11:39:45 -07:00
Gvozden Neskovic	ee36c709c3	Performance optimization of AVL tree comparator functions perf: 2.75x faster ddt_entry_compare() First 256bits of ddt_key_t is a block checksum, which are expected to be close to random data. Hence, on average, comparison only needs to look at first few bytes of the keys. To reduce number of conditional jump instructions, the result is computed as: sign(memcmp(k1, k2)). Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} , which is computed efficiently. Synthetic performance evaluation of original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R) CPU E5-2660 v3: old 6.85789 s new 2.49089 s perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare() Compute the result directly instead of using conditionals perf: zfs_range_compare() Speedup between 1.1x - 2.5x, depending on compiler version and optimization level. perf: spa_error_entry_compare() `bcmp()` is not suitable for comparator use. Use `memcmp()` instead. perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare() perf: 2.8x faster zil_bp_compare() perf: 2.8x faster mze_compare() perf: faster dbuf_compare() perf: faster compares in spa_misc perf: 2.8x faster layout_hash_compare() perf: 2.8x faster space_reftree_compare() perf: libzfs: faster avl tree comparators perf: guid_compare() perf: dsl_deadlist_compare() perf: perm_set_compare() perf: 2x faster range_tree_seg_compare() perf: faster unique_compare() perf: faster vdev_cache _compare() perf: faster vdev_uberblock_compare() perf: faster fuid _compare() perf: faster zfs_znode_hold_compare() Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Richard Elling <richard.elling@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5033	2016-08-31 14:35:34 -07:00
Hajo Möller	82ab6848cc	Fix "zpool get guid,freeing,leaked" source `zpool get guid,freeing,leaked` shows SOURCE as `default`, it should be `-` as those props are not editable. Changed code to not overwrite `src` for `ZPOOL_PROP_VERSION`, so it stays `ZPROP_SRC_NONE`. Make src const to avoid future mistakes Signed-off-by: Hajo Möller <dasjoe@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4170	2016-08-30 15:57:15 -07:00
cao	8f50bafb04	Delete unused zfsctl_snapdir_inactive declaration zfsctl_snapdir_inactive is defined in zfs-0.6.3. In zfs-0.6.5.7 this is declaration remains even though the implementation was removed in commit `278bee93`. Removed fastreboot_disable_highpil which is also unused. Signed-off-by: caoxuewen cao.xuewen@zte.com.cn Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5042	2016-08-30 14:33:40 -07:00
Simon Klinkert	db707ad094	OpenZFS 6940 - Cannot unlink directories when over quota From user perspective, I would expect that ZFS is always able to remove files and directories even when the quota is exceeded. Authored by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6940 OpenZFS-issue: https://www.illumos.org/issues/6334 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/9918916 Closes #5044	2016-08-30 14:33:04 -07:00
Alexander Motin	755065f3dc	OpenZFS 6322 - ZFS indirect block predictive prefetch For quite some time I was thinking about possibility to prefetch ZFS indirection tables while doing sequential reads or writes. Recent changes in predictive prefetcher made that much easier to do. My tests on zvol with 16KB block size on 5x striped and 2x mirrored pool of 10 disks show almost double throughput on sequential read, and almost tripple on sequential rewrite. While for read alike effect can be received from increasing maximal prefetch distance (though at higher memory cost), for rewrite there is no other solution so far. Authored by: Alexander Motin <mav@freebsd.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6322 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/cb92f413 Closes #5040 Porting notes: - Change from upstream in module/zfs/dbuf.c in 'int dbuf_read' due to commit `5f6d0b6` 'Handle block pointers with a corrupt logical size' - Difference from upstream in module/zfs/dmu_zfetch.c, uint32_t zfetch_max_idistance -> unsigned int zfetch_max_idistance - Variables have been initialized at the beginning of the function (void dmu_zfetch) to resemble the order of occurrence and account for C99, C11 mode errors.	2016-08-30 14:26:55 -07:00
Matthew Ahrens	98ace739bd	OpenZFS 7086 - ztest attempts dva_get_dsize_sync on an embedded blockpointer In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at the db_blkptr, to prevent it from being changed by syncing context. Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7086 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/98fa317 Closes #5039	2016-08-30 14:25:50 -07:00
GeLiXin	9907cc1cc8	Add zfs_arc_meta_limit_percent tunable ARC will evict meta buffers that exceed the arc_meta_limit. Before a further investigating on whether we should take special protection on meta buffers, this tunable make arc_meta_limit adjustable for different workloads. People can set zfs_arc_meta_limit_percent to any value while insmod zfs.ko, so some range check is added to guarantee a suitable arc_meta_limit. Suggested by Tim Chase, zfs_arc_dnode_limit is changed to a percent-style tunable as well. Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4957	2016-08-23 13:03:01 -07:00
Tim Chase	3e635ac15c	Prevent reclaim in send_traverse_thread() As is the case with traverse_prefetch_thread(), the deep stacks caused by traversal require disabling reclaim in the send traverse thread. Also, do the same for receive_writer_thread() in which similar problems have been observed. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4912 Closes #4998	2016-08-22 16:12:05 -07:00
Gvozden Neskovic	9cc1844a1d	Linux compat: Grsecurity kernel API Change: Module parameter set/get methods take const parameter in Grsecurity kernel v4.7.1 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Jason Zaman <jason@perfinion.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4997 Closes #5001	2016-08-22 10:05:45 -07:00
Matthew Ahrens	2bce8049c3	OpenZFS 7004 - dmu_tx_hold_zap() does dnode_hold() 7x on same object Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held). dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created: dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write() This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds. Sponsored by: Intel Corp. Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7004 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/109 Closes #4641 Closes #4972	2016-08-19 12:48:03 -07:00
Matthew Ahrens	8bea981504	OpenZFS 7003 - zap_lockdir() should tag hold zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which tags the hold on the zap. This will help diagnose programming errors which misuse the hold on the ZAP. Sponsored by: Intel Corp. Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Pavel Zakharov <pavel.zakha@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7003 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/108 Closes #4972	2016-08-19 12:35:23 -07:00
heary-cao	ee6370a7a4	Fix spa config generate memory leak in spa_load_best function When spa retry load succeeds and spa recovery is requested it may leak in spa_load_best function. Always free the generated config when it is not assigned to the spa. Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4940	2016-08-19 11:17:12 -07:00
Paul Dagnelie	32d41fb73a	OpenZFS 7176 - Yet another hole birth issue This is another bug in the long line of hole-birth related issues. In this particular case, it was discovered that a previous hole-birth fix (illumos bug 6513, commit `bc77ba73`) did not cover as many cases as we thought it did. While the issue worked in the case of hole-punching (writing zeroes to a large part of a file), it did not deal with truncation, and then writing beyond the new end of the file. The problem is that dbuf_findbp will return ENOENT if the block it's trying to find is beyond the end of the file. If that happens, we assume there is no birth time, and so we lose that information when we write out new blkptrs. We should teach dbuf_findbp to look for things that are beyond the current end, but not beyond the absolute end of the file. Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens mahrens@delphix.com Reviewed by: George Wilson george.wilson@delphix.com Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7176 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/173/commits/8b9f3ad Upstream-bugs: DLPX-46009 Porting notes: - Fix ISO C90 mixed declaration error in dbuf.c ( int nlevels, epbs; ) ; keep previous position of the initialization	2016-08-18 09:26:44 -07:00
Matthew Ahrens	d9eea113f8	It is not necessary to zero struct dbuf_hold_impl_data Under a workload which makes heavy use of `dbuf_hold()`, I noticed that a considerable amount of time was spent in `dbuf_hold_impl()`, due to its call to `kmem_zalloc(sizeof (struct dbuf_hold_impl_data) * DBUF_HOLD_IMPL_MAX_DEPTH)`, which is around 2KiB. This structure is used as a stack, to limit the size of the C stack as dbuf_hold() calls itself recursively. We make a recursive call to hold the parent's dbuf when the requested dbuf is not found. The vast majority of the time, the parent or grandparent indirect dbuf is cached, so the number of recursive calls is very low. However, we initialize this entire array for every call to dbuf_hold(). To improve performance, this commit changes `dbuf_hold()` to use `kmem_alloc()` instead of `kmem_zalloc()`. __dbuf_hold_impl_init is changed to initialize all members of the struct before they are used. I observed ~5% performance improvement on a workload which creates many files. Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4974	2016-08-16 15:27:17 -07:00
Gvozden Neskovic	fc897b24b2	Rework of fletcher_4 module - Benchmark memory block is increased to 128kiB to reflect real block sizes more accurately. Measurements include all three stages needed for checksum generation, i.e. `init()/compute()/fini()`. The inner loop is repeated multiple times to offset overhead of time function. - Fastest implementation selects native and byteswap methods independently in benchmark. To support this new function pointers `init_byteswap()/fini_byteswap()` are introduced. - Implementation mutex lock is replaced by atomic variable. - To save time, benchmark is not executed in userspace. Instead, highest supported implementation is used for fastest. Default userspace selector is still 'cycle'. - `fletcher_4_native/byteswap()` methods use incremental methods to finish calculation if data size is not multiple of vector stride (currently 64B). - Added `fletcher_4_native_varsize()` special purpose method for use when buffer size is not known in advance. The method does not enforce 4B alignment on buffer size, and will ignore last (size % 4) bytes of the data buffer. - Benchmark `kstat` is changed to match the one of vdev_raidz. It now shows throughput for all supported implementations (in B/s), native and byteswap, as well as the code [fastest] is running. Example of `fletcher_4_bench` running on `Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz`: implementation native byteswap scalar 4768120823 3426105750 sse2 7947841777 4318964249 ssse3 7951922722 6112191941 avx2 13269714358 11043200912 fastest avx2 avx2 Example of `fletcher_4_bench` running on `Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz`: implementation native byteswap scalar 1291115967 1031555336 sse2 2539571138 1280970926 ssse3 2537778746 1080016762 avx2 4950749767 1078493449 avx512f 9581379998 4010029046 fastest avx512f avx512f Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4952	2016-08-16 14:11:55 -07:00
Rich Ercolani	6d836e6f8b	Add tunable to ignore hole_birth Adds a module option which disables the hole_birth optimization which has been responsible for several recent bugs, including issue #4050. Original-patch: https://gist.github.com/pcd1193182/2c0cd47211f3aee623958b4698836c48 Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4833	2016-08-15 09:52:56 -07:00
GeLiXin	e35c5a8265	Fix incorrect pool state after import Import a raidz pool which has a vdev with a bad label, zpool status shows the right state of the dev, but the wrong state of the pool. The pool state should be DEGRADED, not ONLINE. We examine the label in vdev_validate while in spa_load_impl, the bad label can be detected but doesn't propagate its state to the parent. There are other chances to propagate state in the following vdev_load if we failed to load DTL, but our pool is raidz1 which can tolerate a faulted disk. So we lost the last chance to correct the pool state. Propagate the leaf vdev's state to parent if its label was corrupted, as is done elsewhere in vdev_validate. Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com> Closes #4948	2016-08-12 13:46:51 -07:00
Hans Rosenfeld	fb390aafc8	OpenZFS 5997 - FRU field not set during pool creation and never updated Authored by: Hans Rosenfeld <hans.rosenfeld@nexenta.com> Reviewed by: Dan Fields <dan.fields@nexenta.com> Reviewed by: Josef Sipek <josef.sipek@nexenta.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Signed-off-by: Don Brady <don.brady@intel.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/5997 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1437283 Porting Notes: In addition to the OpenZFS changes this patch realigns the events with those found in OpenZFS. Events which would be logged as sysevents on illumos have been been mapped to the 'sysevent' class for Linux. In addition, several subclass names have been changed to match what is used in OpenZFS. In all cases this means a '.' was changed to an '_' in the subclass. The scripts provided by ZoL have been updated, however users which provide scripts for any of the following events will need to rename them based on the new subclass names. ereport.fs.zfs.config.sync sysevent.fs.zfs.config_sync ereport.fs.zfs.zpool.destroy sysevent.fs.zfs.pool_destroy ereport.fs.zfs.zpool.reguid sysevent.fs.zfs.pool_reguid ereport.fs.zfs.vdev.remove sysevent.fs.zfs.vdev_remove ereport.fs.zfs.vdev.clear sysevent.fs.zfs.vdev_clear ereport.fs.zfs.vdev.check sysevent.fs.zfs.vdev_check ereport.fs.zfs.vdev.spare sysevent.fs.zfs.vdev_spare ereport.fs.zfs.vdev.autoexpand sysevent.fs.zfs.vdev_autoexpand ereport.fs.zfs.resilver.start sysevent.fs.zfs.resilver_start ereport.fs.zfs.resilver.finish sysevent.fs.zfs.resilver_finish ereport.fs.zfs.scrub.start sysevent.fs.zfs.scrub_start ereport.fs.zfs.scrub.finish sysevent.fs.zfs.scrub_finish ereport.fs.zfs.bootfs.vdev.attach sysevent.fs.zfs.bootfs_vdev_attach	2016-08-12 13:06:48 -07:00
Chen Haiquan	d9c97ec08b	Use file_dentry and file_inode wrappers Fix bugs due to kernel change in torvalds/linux@4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay"). This problem crashes system when use zfs as a layer of overlayfs. Signed-off-by: Chen Haiquan <oc@yunify.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4914 Closes #4935	2016-08-11 12:06:37 -07:00
GeLiXin	d5884c3453	Fix indefinite article The indefinite article before nvlist should be "an", not "a". We have 27 "an nvlist" and 7 "a nvlist" in our comment, they should stay the same as we are such a strict filesystem. Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4941	2016-08-11 11:23:49 -07:00
Brian Behlendorf	e5fe9ddeec	Remove custom root pool import code Non-Linux OpenZFS implementations require additional support to be used a root pool. This code should simply be removed to avoid confusion and improve readability. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4951	2016-08-11 11:19:34 -07:00
Brian Behlendorf	cf41432c70	Linux 4.8 compat: Fix removal of bio->bi_rw member All users of bio->bi_rw have been replaced with compatibility wrappers. This allows the kernel specific logic to be abstracted away, and for each of the supported cases to be documented with the wrapper. The updated interfaces are as follows: * void blk_queue_set_write_cache(struct request_queue , bool, bool) boolean_t bio_is_flush(struct bio ) boolean_t bio_is_fua(struct bio ) boolean_t bio_is_discard(struct bio ) boolean_t bio_is_secure_erase(struct bio ) VDEV_WRITE_FLUSH_FUA Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4951	2016-08-11 11:19:34 -07:00
Gvozden Neskovic	689f093ebc	Build user-space with different gcc optimization levels This fix resolves warnings reported during compiling of user-space libraries with different gcc optimization levels. Tested with gcc versions: 4.9.2 (Debian), and 6.1.1 (Fedora). The patch enables use of following opt levels: O0, O1, O2, O3, Og, Os, Ofast. List of warnings: [GCC 4.9.2 -Os] libzfs_sendrecv.c:3726:26: error: 'clp' may be used uninitialized in this function [-Werror=maybe-uninitialized] [GCC 4.9.2 -Og] fs_fletcher.c:323:26: error: 'idx' may be used uninitialized in this function [-Werror=maybe-uninitialized] dsl_dataset.c:1290:12: error: 'atp' may be used uninitialized in this function [-Werror=maybe-uninitialized] [GCC 4.9.2 -Ofast] u8_textprep.c:1310:9: error: 'tc[3ul]' may be used uninitialized in this function [-Werror=maybe-uninitialized] u8_textprep.c:177:23: error: 'u8t[0ul]' may be used uninitialized in this function [-Werror=maybe-uninitialized] dsl_dataset.c:2089:37: error: ‘hds’ may be used uninitialized in this function [-Werror=maybe-uninitialized] dsl_dataset.c:3216:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized] dsl_dataset.c:1591:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized] dsl_dataset.c:3341:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized] vdev_raidz.c:1153:8: error: 'dcount[2]' may be used uninitialized in this function [-Werror=maybe-uninitialized] vdev_raidz.c:1167:17: error: 'dst[2]' may be used uninitialized in this function [-Werror=maybe-uninitialized] kernel.c:1005:2: error: ‘resid’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:2826:8: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:3056:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:1584:13: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:3056:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:1792:66: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:3986:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] [GCC 6.1.1] Resolved in PR #4907 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4937	2016-08-09 14:40:35 -07:00
Chunwei Chen	afb6c031e8	Linux 4.7 compat: fix zpl_get_acl returns invalid acl pointer Starting from Linux 4.7, get_acl will set acl cache pointer to temporary sentinel value before calling i_op->get_acl. Therefore we can't compare against ACL_NOT_CACHED and return. Since from Linux 3.14, get_acl already check the cache for us, so we disable this in zpl_get_acl. Linux 4.7 also does set_cached_acl for us so we disable it in zpl_get_acl. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4944 Closes #4946	2016-08-09 10:03:04 -07:00
Brian Behlendorf	4b908d3220	Linux 4.8 compat: posix_acl_valid() The posix_acl_valid() function has been updated to require a user namespace. Filesystem callers should normally provide the user_ns from the super block associcated with the ACL; the zpl_posix_acl_valid() wrapper has been added for this purpose. See https://github.com/torvalds/linux/commit/0d4d717f for complete details. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4922	2016-08-08 11:46:40 -07:00
Brian Behlendorf	e85a6396b0	Retire HAVE_CURRENT_UMASK and HAVE_POSIX_ACL_CACHING Remove ZFS_AC_KERNEL_CURRENT_UMASK and ZFS_AC_KERNEL_POSIX_ACL_CACHING configure checks, all supported kernel provide this functionality. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4922	2016-08-08 11:46:32 -07:00
Nikolay Borisov	64aefee1b8	Fix interaction between userns uid/gid and SA * When the uid/gid change is handled in zfs_setattr we want to actually adjust the user passed uid to a KUID and write that to disk. * In trace points use the i_uid member without doing translation, since it has already been performed. * Use kuid in zfs_aclset_common Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4928	2016-08-08 10:47:43 -07:00
Gaurav Kumar	cf2731e65b	arc_meta_limit should be updated when arc_max is changed. When arc_max is increased, arc_meta_limit will not be updated to 3/4 of the new arc_c_max value. This was done originally to preserve any existing maximum value. This turned out to be counter intuitive to users and this fix changes that behavior. If zfs_arc_meta_limit is non-default, it will be picked up later in the ARC tuning function. Signed-off-by: Gaurav Kumar <gaurav.kumar@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4893	2016-08-02 13:43:36 -07:00
Brian Behlendorf	efe7978d89	Fix gcc self-comparison warning As of gcc 6.1.1 20160621 (Red Hat 6.1.1-3) a self-comparison is detected by gcc in metaslab_alloc(). Resolve the warning by passing a physical size of 0 to BP_SET_BIRTH() as it done by other callers. module/zfs/metaslab.c: In function ‘metaslab_alloc’: module/zfs/metaslab.c:2575:184: error: self-comparison always evaluates to true [-Werror=tautological-compare] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Issue #4907	2016-08-02 13:14:18 -07:00
Tony Hutter	4eb0db42d3	Fix possible VDEV stats array overflow Fix a possible VDEV statistics array overflow when ZIOs with ZIO_PRIORITY_NOW complete. Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4883 Closes #4917	2016-08-02 08:45:24 -07:00
Nikolay Borisov	ba2fe6affb	Move assignment of i_blkbits field Currently i_blkbits is always set to SPA_MINBLOCKSHIFT every time zfs_inode_update_impl is called. Since this value never changes move its assignment to at inode creation time. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4906	2016-07-29 15:34:12 -07:00
heary-cao	9f3d1407dc	Fix zfs_allow_log_destroy() NULL dereference In zfs_ioc_log_history() function the tsd_set() function is called with NULL which causes the zfs_allow_log_destroy() to be run. In this case the passed value will be NULL. This is normally entirely safe because strfree() maps directly to kfree() which may be passed a NULL. However, since alternate implementations of strfree() may not handle this gracefully add a check for NULL. Observed under an embedded Linux 2.6.32.41 kernel running the automated testing while running the ZFS Test Suite. Signed-off-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4872	2016-07-29 15:34:12 -07:00
Chunwei Chen	3b86aeb295	Linux 4.8 compat: REQ_OP and bio_set_op_attrs() New REQ_OP_* definitions have been introduced to separate the WRITE, READ, and DISCARD operations from the flags. This included changing the encoding of bi_rw. It places REQ_OP_* in high order bits and other stuff in low order bits. This encoding is done through the new helper function bio_set_op_attrs. For complete details refer to: https://github.com/torvalds/linux/commit/f215082 https://github.com/torvalds/linux/commit/4e1b2d5 Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4892 Closes #4899	2016-07-29 14:48:19 -07:00
Brian Behlendorf	bbb1b6cea7	Linux 4.8 compat: submit_bio() The rw argument has been removed from submit_bio/submit_bio_wait. Callers are now expected to set bio->bi_rw instead of passing it in. See https://github.com/torvalds/linux/commit/4e49ea4a for complete details. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4892 Issue #4899	2016-07-29 14:48:00 -07:00
Richard Yao	f26b4b3c8a	txg visibility code should not execute under tc_open_lock The memory allocation and locking in `spa_txg_history_*()` can potentially block txg_hold_open for arbitrarily long periods of time. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4333	2016-07-27 14:11:13 -07:00
Brian Behlendorf	fcf64f45d9	Fix zdb crash with 4K-only devices Here's the problem - on 4K native devices in userland on Linux using O_DIRECT, buffers must be 4K aligned or I/O will fail with EINVAL, causing zdb (and others) to coredump. Since userland probably doesn't need optimized buffer caches, we just force 4K alignment on everything. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #4479	2016-07-27 13:38:46 -07:00
Colin Ian King	bf18fd89f9	void integer overflow on computation of refquota_slack DMU_MAX_ACCESS should be cast to a uint64_t otherwise the multiplication of DMU_MAX_ACCESS with spa_asize_inflation will be 32 bit and may lead to an overflow. Currently DMU_MAX_ACCESS is 64 * 1024 * 1024, so spa_asize_inflation being 64 or more will lead to an overflow. Found by static analysis with CoverityScan 0.8.5 CID 150942 (#1 of 1): Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN) overflow_before_widen: Potentially overflowing expression 67108864 * spa_asize_inflation with type int (32 bits, signed) is evaluated using 32-bit arithmetic, and then used in a context that expects an expression of type uint64_t (64 bits, unsigned). Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4889	2016-07-27 13:38:46 -07:00
Tim Chase	25458cbef9	Limit the amount of dnode metadata in the ARC Metadata-intensive workloads can cause the ARC to become permanently filled with dnode_t objects as they're pinned by the VFS layer. Subsequent data-intensive workloads may only benefit from about 25% of the potential ARC (arc_c_max - arc_meta_limit). In order to help track metadata usage more precisely, the other_size metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size. The new zfs_arc_dnode_limit tunable, which defaults to 10% of zfs_arc_meta_limit, defines the minimum number of bytes which is desirable to be consumed by dnodes. Attempts to evict non-metadata will trigger async prune tasks if the space used by dnodes exceeds this limit. The new zfs_arc_dnode_reduce_percent tunable specifies the amount by which the excess dnode space is attempted to be pruned as a percentage of the amount by which zfs_arc_dnode_limit is being exceeded. By default, it tries to unpin 10% of the dnodes. The problem of dnode metadata pinning was observed with the following testing procedure (in this example, zfs_arc_max is set to 4GiB): - Create a large number of small files until arc_meta_used exceeds arc_meta_limit (3GiB with default tuning) and arc_prune starts increasing. - Create a 3GiB file with dd. Observe arc_mata_used. It will still be around 3GiB. - Repeatedly read the 3GiB file and observe arc_meta_limit as before. It will continue to stay around 3GiB. With this modification, space for the 3GiB file is gradually made available as subsequent demands on the ARC are made. The previous behavior can be restored by setting zfs_arc_dnode_limit to the same value as the zfs_arc_meta_limit. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4345 Issue #4512 Issue #4773 Closes #4858	2016-07-25 15:26:38 -07:00
Tim Chase	e6603b7c1f	Fix sync behavior for disk vdevs Prior to `b39c22b`, which was first generally available in the 0.6.5 release as `b39c22b`, ZoL never actually submitted synchronous read or write requests to the Linux block layer. This means the vdev_disk_dio_is_sync() function had always returned false and, therefore, the completion in dio_request_t.dr_comp was never actually used. In `b39c22b`, synchronous ZIO operations were translated to synchronous BIO requests in vdev_disk_io_start(). The follow-on commits `5592404` and `aa159af` fixed several problems introduced by `b39c22b`. In particular, `5592404` introduced the new flag parameter "wait" to __vdev_disk_physio() but under ZoL, since vdev_disk_physio() is never actually used, the wait flag was always zero so the new code had no effect other than to cause a bug in the use of the dio_request_t.dr_comp which was fixed by `aa159af`. The original rationale for introducing synchronous operations in `b39c22b` was to hurry certains requests through the BIO layer which would have otherwise been subject to its unplug timer which would increase the latency. This behavior of the unplug timer, however, went away during the transition of the plug/unplug system between kernels 2.6.32 and 2.6.39. To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior. For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and ise used for the same purpose. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4858	2016-07-25 14:24:47 -07:00
Nikolay Borisov	2c6abf15ff	Remove znode's z_uid/z_gid member Remove duplicate z_uid/z_gid member which are also held in the generic vfs inode struct. This is done by first removing the members from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID macros to access the respective member from struct inode. In cases where the uid/gids are being marshalled from/to disk, use the newly introduced zfs_(uid\|gid)_(read\|write) functions to properly save the uids rather than the internal kernel representation. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4685 Issue #227	2016-07-25 13:21:49 -07:00
Tom Caputi	77943bc1dc	Fix for metaslab_fastwrite_unmark() assert failure Currently there is an issue where metaslab_fastwrite_unmark() unmarks fastwrites on vdev_t's that have never had fastwrites marked on them. The 'fastwrite mark' is essentially a count of outstanding bytes that will be written to a vdev and is used in syncing context. The problem stems from the fact that the vdev_pending_fastwrite field is not being transferred over when replacing a top-level vdev. As a result, the metaslab is marked for fastwrite on the old vdev and unmarked on the new one, which brings the fastwrite count below zero. This fix simply assigns vdev_pending_fastwrite from the old vdev to the new one so this count is not lost. Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4267	2016-07-25 13:21:43 -07:00
Chunwei Chen	be88e733a6	Fix NULL pointer in zfs_preumount from `1d9b3bd` When zfs_domount fails zsb will be freed, and its caller mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into zfs_preumount. In order to make sure we don't touch any nonexistent stuff, we must make sure s_fs_info is NULL in the fail path so zfs_preumount can easily check that. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4867 Issue #4854	2016-07-20 09:05:43 -07:00
Gvozden Neskovic	26a08b5ca9	RAIDZ parity kstat rework Print table with speed of methods for each implementation. Last line describes contents of [fastest] selection. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4860	2016-07-19 16:43:07 -07:00
Gvozden Neskovic	c9187d867f	Fixes and enhancements of SIMD raidz parity - Implementation lock replaced with atomic variable - Trailing whitespace is removed from user specified parameter, to enhance experience when using commands that add newline, e.g. `echo` - raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813 - silence `cppcheck` in vdev_raidz, partial solution of Issue #1392 - Minor fixes and cleanups - Enable use of original parity methods in [fastest] configuration. New opaque original ops structure, representing native methods, is added to supported raidz methods. Original parity methods are executed if selected implementation has NULL fn pointer. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4813 Issue #1392	2016-07-19 16:43:07 -07:00
Chunwei Chen	1d9b3bd8fb	Wait iput_async before evict_inodes to prevent race Wait for iput_async before entering evict_inodes in generic_shutdown_super. The reason we must finish before evict_inodes is when lazytime is on, or when zfs_purgedir calls zfs_zget, iput would bump i_count from 0 to 1. This would race with the i_count check in evict_inodes. This means it could destroy the inode while we are still using it. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4854	2016-07-19 09:23:58 -07:00
Roman Strashkin	1b87e0f532	Fix filesystem destroy with receive_resume_token It is possible that the given DS may have hidden child (%recv) datasets - "leftovers" resulting from the previously interrupted 'zfs receieve'. Try to remove the hidden child (%recv) and after that try to remove the target dataset. If the hidden child (%recv) does not exist the original error (EEXIST) will be returned. Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4818	2016-07-15 15:34:46 -07:00
Chris Dunlop	dfbc86309f	Use native inode->i_nlink instead of znode->z_links A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's 64 bit on-disk link count. We revert "xattr dir doesn't get purged during iput" (`ddae16a`) as this is a more Linux-integrated fix for the same issue. In addition, setting the initial link count on a new node has been changed from setting one less than required in zfs_mknode() then incrementing to the correct count in zfs_link_create() (which was somewhat bizarre in the first place), to setting the correct count in zfs_mknode() and not incrementing it in zfs_link_create(). This both means we no longer set the link count in sa_bulk_update() twice (once for the initial incorrect count then again for the correct count), as well as adhering to the Linux requirement of not incrementing a zero link count without I_LINKABLE (see linux commit f4e0c30c). Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4838 Issue #227	2016-07-14 16:25:34 -07:00
Chunwei Chen	02de3e3c5d	Fix dbuf_stats_hash_table_data race Dropping DBUF_HASH_MUTEX when walking the hash list is unsafe. The dbuf can be freed at any time. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4846	2016-07-14 16:25:04 -07:00
Tim Chase	8887c7d778	Prevent null dereferences when accessing dbuf kstat In arc_buf_info(), the arc_buf_t may have no header. If not, don't try to fetch the arc buffer stats and instead just zero them. The null dereferences were observed while accessing the dbuf kstat with awk on a system in which millions of small files were being created in order to overflow the system's metadata limit. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4837	2016-07-14 16:25:03 -07:00
Gvozden Neskovic	ae25d22235	Add RAID-Z routines for SSE2 instruction set, in x86_64 mode. The patch covers low-end and older x86 CPUs. Parity generation is equivalent to SSSE3 implementation, but reconstruction is somewhat slower. Previous 'sse' implementation is renamed to 'ssse3' to indicate highest instruction set used. Benchmark results: scalar_rec_p 4 720476442 scalar_rec_q 4 187462804 scalar_rec_r 4 138996096 scalar_rec_pq 4 140834951 scalar_rec_pr 4 129332035 scalar_rec_qr 4 81619194 scalar_rec_pqr 4 53376668 sse2_rec_p 4 2427757064 sse2_rec_q 4 747120861 sse2_rec_r 4 499871637 sse2_rec_pq 4 522403710 sse2_rec_pr 4 464632780 sse2_rec_qr 4 319124434 sse2_rec_pqr 4 205794190 ssse3_rec_p 4 2519939444 ssse3_rec_q 4 1003019289 ssse3_rec_r 4 616428767 ssse3_rec_pq 4 706326396 ssse3_rec_pr 4 570493618 ssse3_rec_qr 4 400185250 ssse3_rec_pqr 4 377541245 original_rec_p 4 691658568 original_rec_q 4 195510948 original_rec_r 4 26075538 original_rec_pq 4 103087368 original_rec_pr 4 15767058 original_rec_qr 4 15513175 original_rec_pqr 4 10746357 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4783	2016-07-13 10:24:55 -07:00
Gvozden Neskovic	1bf3bf0e29	Fix handling of errors nvlist in zfs_ioc_recv_new() zfs_ioc_recv_impl() is changed to always allocate the 'errors' nvlist, its callers are responsible for freeing it. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4829	2016-07-13 10:20:08 -07:00
Peng	81edd3e834	Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z The following scenario can result in garbage in the dn_spill field. The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR is clear to ensure the dn_spill field is cleared. Current txg = A. * A new spill buffer is created. Its dbuf is initialized with db_blkptr = NULL and it's dirtied. Current txg = B. * The spill buffer is modified. It's marked as dirty in this txg. * Additional changes make the spill buffer unnecessary because the xattr fits into the bonus buffer, so it's removed. The dbuf is undirtied in this txg, but it's still referenced and cannot be destroyed. Current txg = C. * Starts syncing of txg A * dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr is NULL, dbuf_check_blkptr() is called. * The dbuf starts being written and it reaches the ready state (not done yet). * A new change makes the spill buffer necessary again. sa_build_layouts() ends up calling dbuf_find() to locate the dbuf. It finds the old dbuf because it has not been destroyed yet (it will be destroyed when the previous write is done and there are no more references). The old dbuf has db_blkptr != NULL. * txg A write is complete and the dbuf released. However it's still referenced, so it's not destroyed. Current txg = D. * Starts syncing of txg B * dbuf_sync_leaf() is called for the bonus buffer. Its contents are directly copied into the dnode, overwriting the blkptr area because, in txg B, the bonus buffer was big enough to hold the entire xattr. * At this point, the db_blkptr of the spill buffer used in txg C gets corrupted. Signed-off-by: Peng <peng.hse@xtaotech.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3937	2016-07-12 16:47:44 -07:00
Chunwei Chen	31b6111fd9	Kill zp->z_xattr_parent to prevent pinning zp->z_xattr_parent will pin the parent. This will cause huge issue when unlink a file with xattr. Because the unlinked file is pinned, it will never get purged immediately. And because of that, the xattr stuff will never be marked as unlinked. So the whole unlinked stuff will stay there until shrink cache or umount. This change partially reverts `e89260a`. This is safe because only the zp->z_xattr_parent optimization is removed, zpl_xattr_security_init() is still called from the zpl outside the inode lock. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue #4359 Issue #3508 Issue #4413 Issue #4827	2016-07-12 14:18:10 -07:00
Chunwei Chen	ddae16a9cf	xattr dir doesn't get purged during iput We need to set inode->i_nlink to zero so iput will purge it. Without this, it will get purged during shrink cache or umount, which would likely result in deadlock due to zfs_zget waiting forever on its children which are in the dispose_list of the same thread. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue #4359 Issue #3508 Issue #4413 Issue #4827	2016-07-12 14:04:30 -07:00
Chunwei Chen	6c2530647c	fh_to_dentry should return ESTALE when generation mismatch When generation mismatch, it usually means the file pointed by the file handle was deleted. We should return ESTALE to indicate this. We return ENOENT in zfs_vget since zpl_fh_to_dentry will convert it to ESTALE. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4828	2016-07-12 13:34:15 -07:00
Chunwei Chen	bffb68a2b8	Fix Large kmem_alloc in vdev_metaslab_init This allocation can go way over 1MB, so we should use vmem_alloc instead of kmem_alloc. Large kmem_alloc(1430784, 0x1000), please file an issue... Call Trace: [<ffffffffa0324aff>] ? spl_kmem_zalloc+0xef/0x160 [spl] [<ffffffffa17d0c8d>] ? vdev_metaslab_init+0x9d/0x1f0 [zfs] [<ffffffffa17d46d0>] ? vdev_load+0xc0/0xd0 [zfs] [<ffffffffa17d4643>] ? vdev_load+0x33/0xd0 [zfs] [<ffffffffa17c0004>] ? spa_load+0xfc4/0x1b60 [zfs] [<ffffffffa17c1838>] ? spa_tryimport+0x98/0x430 [zfs] [<ffffffffa17f28b1>] ? zfs_ioc_pool_tryimport+0x41/0x80 [zfs] [<ffffffffa17f5669>] ? zfsdev_ioctl+0x4a9/0x4e0 [zfs] [<ffffffff811bacdf>] ? do_vfs_ioctl+0x2cf/0x4b0 [<ffffffff811baf41>] ? SyS_ioctl+0x81/0xa0 Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4752	2016-07-12 13:34:15 -07:00
Chunwei Chen	7938c2aca7	Don't allow accessing XATTR via export handle Allow accessing XATTR through export handle is a very bad idea. It would allow user to write whatever they want in fields where they otherwise could not. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4828	2016-07-12 13:34:14 -07:00
Chunwei Chen	061460dfe2	Fix get_zfs_sb race with concurrent umount Certain ioctl operations will call get_zfs_sb, which will holds an active count on sb without checking whether it's active or not. This will result in use-after-free. We fix this by using atomic_inc_not_zero to make sure we got an active sb. P1 P2 --- --- deactivate_locked_super(): s_active = 0 zfs_sb_hold() ->get_zfs_sb(): s_active = 1 ->zpl_kill_sb() -->zpl_put_super() --->zfs_umount() ---->zfs_sb_free(zsb) zfs_sb_rele(zsb) Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-07-12 13:34:14 -07:00
Gvozden Neskovic	590c9a0994	Allow building with `CFLAGS="-O0"` If compiled with -O0, gcc doesn't do any stack frame coalescing and -Wframe-larger-than=1024 is triggered in debug mode. Starting with gcc 4.8, new opt level -Og is introduced for debugging, which does not trigger this warning. Fix bench zio size, using SPA_OLD_MAXBLOCKSHIFT Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4799	2016-07-11 16:53:02 -07:00
Paul Dagnelie	d1d19c7854	OpenZFS 6876 - Stack corruption after importing a pool with a too-long name Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for trouble. We should check every dataset on import, using a 1024 byte buffer and checking each time to see if the dataset's new name is longer than 256 bytes. OpenZFS-issue: https://www.illumos.org/issues/6876 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e	2016-06-28 13:47:04 -07:00
Igor Kozhukhov	eca7b76001	OpenZFS 6314 - buffer overflow in dsl_dataset_name Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6314 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d6160ee	2016-06-28 13:47:03 -07:00
Brian Behlendorf	43e52eddb1	Implement zfs_ioc_recv_new() for OpenZFS 2605 Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy ZFS_IOC_RECV user/kernel interface. The new interface supports all stream options but is currently only used for resumable streams. This way updated user space utilities will interoperate with older kernel modules. ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW handler. Non-Linux OpenZFS platforms have opted to change the legacy interface in an incompatible fashion instead of adding a new ioctl. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-06-28 13:47:03 -07:00
Dan McDonald	8c62a0d0f3	OpenZFS 6562 - Refquota on receive doesn't account for overage Authored by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Gordon Ross <gwr@nexenta.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6562 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5f7a8e6	2016-06-28 13:47:03 -07:00
Dan McDonald	671c93546c	OpenZFS 4986 - receiving replication stream fails if any snapshot exceeds refquota Authored by: Dan McDonald <danmcd@omniti.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/4986 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5878fad	2016-06-28 13:47:03 -07:00
Eli Rosenthal	f8866f8ae3	OpenZFS 6738 - zfs send stream padding needs documentation Authored by: Eli Rosenthal <eli.rosenthal@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6738 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c20404ff	2016-06-28 13:47:03 -07:00
Andrew Stormont	b607405fea	OpenZFS 6536 - zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS Authored by: Andrew Stormont <astormont@racktopsystems.com> Reviewed by: Anil Vijarnia <avijarnia@racktopsystems.com> Reviewed by: Kim Shrier <kshrier@racktopsystems.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6536 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/880094b	2016-06-28 13:47:03 -07:00
Paul Dagnelie	e6d3a843d6	OpenZFS 6393 - zfs receive a full send as a clone Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6394 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68ecb2e	2016-06-28 13:47:03 -07:00
Matthew Ahrens	47dfff3b86	OpenZFS 2605, 6980, 6902 2605 want to resume interrupted zfs send Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Xin Li <delphij@freebsd.org> Reviewed by: Arne Jansen <sensille@gmx.net> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/2605 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12 6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6980 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f Porting notes: - All rsend and snapshop tests enabled and updated for Linux. - Fix misuse of input argument in traverse_visitbp(). - Fix ISO C90 warnings and errors. - Fix gcc 'missing braces around initializer' in 'struct send_thread_arg to_arg =' warning. - Replace 4 argument fletcher_4_native() with 3 argument version, this change was made in OpenZFS 4185 which has not been ported. - Part of the sections for 'zfs receive' and 'zfs send' was rewritten and reordered to approximate upstream. - Fix mktree xattr creation, 'user.' prefix required. - Minor fixes to newly enabled test cases - Long holds for volumes allowed during receive for minor registration.	2016-06-28 13:47:02 -07:00
Ned Bass	50c957f702	Implement large_dnode pool feature Justification ------------- This feature adds support for variable length dnodes. Our motivation is to eliminate the overhead associated with using spill blocks. Spill blocks are used to store system attribute data (i.e. file metadata) that does not fit in the dnode's bonus buffer. By allowing a larger bonus buffer area the use of a spill block can be avoided. Spill blocks potentially incur an additional read I/O for every dnode in a dnode block. As a worst case example, reading 32 dnodes from a 16k dnode block and all of the spill blocks could issue 33 separate reads. Now suppose those dnodes have size 1024 and therefore don't need spill blocks. Then the worst case number of blocks read is reduced to from 33 to two--one per dnode block. In practice spill blocks may tend to be co-located on disk with the dnode blocks so the reduction in I/O would not be this drastic. In a badly fragmented pool, however, the improvement could be significant. ZFS-on-Linux systems that make heavy use of extended attributes would benefit from this feature. In particular, ZFS-on-Linux supports the xattr=sa dataset property which allows file extended attribute data to be stored in the dnode bonus buffer as an alternative to the traditional directory-based format. Workloads such as SELinux and the Lustre distributed filesystem often store enough xattr data to force spill bocks when xattr=sa is in effect. Large dnodes may therefore provide a performance benefit to such systems. Other use cases that may benefit from this feature include files with large ACLs and symbolic links with long target names. Furthermore, this feature may be desirable on other platforms in case future applications or features are developed that could make use of a larger bonus buffer area. Implementation -------------- The size of a dnode may be a multiple of 512 bytes up to the size of a dnode block (currently 16384 bytes). A dn_extra_slots field was added to the current on-disk dnode_phys_t structure to describe the size of the physical dnode on disk. The 8 bits for this field were taken from the zero filled dn_pad2 field. The field represents how many "extra" dnode_phys_t slots a dnode consumes in its dnode block. This convention results in a value of 0 for 512 byte dnodes which preserves on-disk format compatibility with older software. Similarly, the in-memory dnode_t structure has a new dn_num_slots field to represent the total number of dnode_phys_t slots consumed on disk. Thus dn->dn_num_slots is 1 greater than the corresponding dnp->dn_extra_slots. This difference in convention was adopted because, unlike on-disk structures, backward compatibility is not a concern for in-memory objects, so we used a more natural way to represent size for a dnode_t. The default size for newly created dnodes is determined by the value of a new "dnodesize" dataset property. By default the property is set to "legacy" which is compatible with older software. Setting the property to "auto" will allow the filesystem to choose the most suitable dnode size. Currently this just sets the default dnode size to 1k, but future code improvements could dynamically choose a size based on observed workload patterns. Dnodes of varying sizes can coexist within the same dataset and even within the same dnode block. For example, to enable automatically-sized dnodes, run # zfs set dnodesize=auto tank/fish The user can also specify literal values for the dnodesize property. These are currently limited to powers of two from 1k to 16k. The power-of-2 limitation is only for simplicity of the user interface. Internally the implementation can handle any multiple of 512 up to 16k, and consumers of the DMU API can specify any legal dnode value. The size of a new dnode is determined at object allocation time and stored as a new field in the znode in-memory structure. New DMU interfaces are added to allow the consumer to specify the dnode size that a newly allocated object should use. Existing interfaces are unchanged to avoid having to update every call site and to preserve compatibility with external consumers such as Lustre. The new interfaces names are given below. The versions of these functions that don't take a dnodesize parameter now just call the _dnsize() versions with a dnodesize of 0, which means use the legacy dnode size. New DMU interfaces: dmu_object_alloc_dnsize() dmu_object_claim_dnsize() dmu_object_reclaim_dnsize() New ZAP interfaces: zap_create_dnsize() zap_create_norm_dnsize() zap_create_flags_dnsize() zap_create_claim_norm_dnsize() zap_create_link_dnsize() The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The spa_maxdnodesize() function should be used to determine the maximum bonus length for a pool. These are a few noteworthy changes to key functions: * The prototype for dnode_hold_impl() now takes a "slots" parameter. When the DNODE_MUST_BE_FREE flag is set, this parameter is used to ensure the hole at the specified object offset is large enough to hold the dnode being created. The slots parameter is also used to ensure a dnode does not span multiple dnode blocks. In both of these cases, if a failure occurs, ENOSPC is returned. Keep in mind, these failure cases are only possible when using DNODE_MUST_BE_FREE. If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0. dnode_hold_impl() will check if the requested dnode is already consumed as an extra dnode slot by an large dnode, in which case it returns ENOENT. * The function dmu_object_alloc() advances to the next dnode block if dnode_hold_impl() returns an error for a requested object. This is because the beginning of the next dnode block is the only location it can safely assume to either be a hole or a valid starting point for a dnode. * dnode_next_offset_level() and other functions that iterate through dnode blocks may no longer use a simple array indexing scheme. These now use the current dnode's dn_num_slots field to advance to the next dnode in the block. This is to ensure we properly skip the current dnode's bonus area and don't interpret it as a valid dnode. zdb --- The zdb command was updated to display a dnode's size under the "dnsize" column when the object is dumped. For ZIL create log records, zdb will now display the slot count for the object. ztest ----- Ztest chooses a random dnodesize for every newly created object. The random distribution is more heavily weighted toward small dnodes to better simulate real-world datasets. Unused bonus buffer space is filled with non-zero values computed from the object number, dataset id, offset, and generation number. This helps ensure that the dnode traversal code properly skips the interior regions of large dnodes, and that these interior regions are not overwritten by data belonging to other dnodes. A new test visits each object in a dataset. It verifies that the actual dnode size matches what was stored in the ztest block tag when it was created. It also verifies that the unused bonus buffer space is filled with the expected data patterns. ZFS Test Suite -------------- Added six new large dnode-specific tests, and integrated the dnodesize property into existing tests for zfs allow and send/recv. Send/Receive ------------ ZFS send streams for datasets containing large dnodes cannot be received on pools that don't support the large_dnode feature. A send stream with large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be unrecognized by an incompatible receiving pool so that the zfs receive will fail gracefully. While not implemented here, it may be possible to generate a backward-compatible send stream from a dataset containing large dnodes. The implementation may be tricky, however, because the send object record for a large dnode would need to be resized to a 512 byte dnode, possibly kicking in a spill block in the process. This means we would need to construct a new SA layout and possibly register it in the SA layout object. The SA layout is normally just sent as an ordinary object record. But if we are constructing new layouts while generating the send stream we'd have to build the SA layout object dynamically and send it at the end of the stream. For sending and receiving between pools that do support large dnodes, the drr_object send record type is extended with a new field to store the dnode slot count. This field was repurposed from unused padding in the structure. ZIL Replay ---------- The dnode slot count is stored in the uppermost 8 bits of the lr_foid field. The bits were unused as the object id is currently capped at 48 bits. Resizing Dnodes --------------- It should be possible to resize a dnode when it is dirtied if the current dnodesize dataset property differs from the dnode's size, but this functionality is not currently implemented. Clearly a dnode can only grow if there are sufficient contiguous unused slots in the dnode block, but it should always be possible to shrink a dnode. Growing dnodes may be useful to reduce fragmentation in a pool with many spill blocks in use. Shrinking dnodes may be useful to allow sending a dataset to a pool that doesn't support the large_dnode feature. Feature Reference Counting -------------------------- The reference count for the large_dnode pool feature tracks the number of datasets that have ever contained a dnode of size larger than 512 bytes. The first time a large dnode is created in a dataset the dataset is converted to an extensible dataset. This is a one-way operation and the only way to decrement the feature count is to destroy the dataset, even if the dataset no longer contains any large dnodes. The complexity of reference counting on a per-dnode basis was too high, so we chose to track it on a per-dataset basis similarly to the large_block feature. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3542	2016-06-24 13:13:21 -07:00
Ned Bass	68cbd56e18	Backfill metadnode more intelligently Only attempt to backfill lower metadnode object numbers if at least 4096 objects have been freed since the last rescan, and at most once per transaction group. This avoids a pathology in dmu_object_alloc() that caused O(N^2) behavior for create-heavy workloads and substantially improves object creation rates. As summarized by @mahrens in #4636: "Normally, the object allocator simply checks to see if the next object is available. The slow calls happened when dmu_object_alloc() checks to see if it can backfill lower object numbers. This happens every time we move on to a new L1 indirect block (i.e. every 32 * 128 = 4096 objects). When re-checking lower object numbers, we use the on-disk fill count (blkptr_t:blk_fill) to quickly skip over indirect blocks that don’t have enough free dnodes (defined as an L2 with at least 393,216 of 524,288 dnodes free). Therefore, we may find that a block of dnodes has a low (or zero) fill count, and yet we can’t allocate any of its dnodes, because they've been allocated in memory but not yet written to disk. In this case we have to hold each of the dnodes and then notice that it has been allocated in memory. The end result is that allocating N objects in the same TXG can require CPU usage proportional to N^2." Add a tunable dmu_rescan_dnode_threshold to define the number of objects that must be freed before a rescan is performed. Don't bother to export this as a module option because testing doesn't show a compelling reason to change it. The vast majority of the performance gain comes from limit the rescan to at most once per TXG. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-06-24 13:13:12 -07:00
smh	d14fa5dba1	FreeBSD rS271776 - Persist vdev_resilver_txg changes Persist vdev_resilver_txg changes to avoid panic caused by validation vs a vdev_resilver_txg value from a previous resilver. Authored-by: smh <smh@FreeBSD.org> Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/5154 FreeBSD-issue: https://reviews.freebsd.org/rS271776 FreeBSD-commit: https://github.com/freebsd/freebsd/commit/c3c60bf Closes #4790	2016-06-24 13:02:42 -07:00
Nav Ravindranath	784d15c14c	OpenZFS 6878 - Add scrub completion info to "zpool history" Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Authored by: Nav Ravindranath <nav@delphix.com> Ported-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6878 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1825bc5 Closes #4787	2016-06-24 13:01:31 -07:00
Paul Dagnelie	bc77ba73fe	OpenZFS 6513 - partially filled holes lose birth time Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Richard Lowe <richlowe@richlowe.net>a Ported by: Boris Protopopov <bprotopopov@actifio.com> Signed-off-by: Boris Protopopov <bprotopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6513 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8df0bcf0 If a ZFS object contains a hole at level one, and then a data block is created at level 0 underneath that l1 block, l0 holes will be created. However, these l0 holes do not have the birth time property set; as a result, incremental sends will not send those holes. Fix is to modify the dbuf_read code to fill in birth time data.	2016-06-21 10:55:13 -07:00
Chunwei Chen	100a91aa3e	Fix NFS credential The commit `f74b821` caused a regression where creating file through NFS will always create a file owned by root. This is because the patch enables the KSID code in zfs_acl_ids_create, which it would use euid and egid of the current process. However, on Linux, we should use fsuid and fsgid for file operations, which is the original behaviour. So we revert this part of code. The patch also enables secpolicy_vnode_*, since they are also used in file operations, we change them to use fsuid and fsgid. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4772 Closes #4758	2016-06-21 09:58:37 -07:00
Gvozden Neskovic	ab9f4b0b82	SIMD implementation of vdev_raidz generate and reconstruct routines This is a new implementation of RAIDZ1/2/3 routines using x86_64 scalar, SSE, and AVX2 instruction sets. Included are 3 parity generation routines (P, PQ, and PQR) and 7 reconstruction routines, for all RAIDZ level. On module load, a quick benchmark of supported routines will select the fastest for each operation and they will be used at runtime. Original implementation is still present and can be selected via module parameter. Patch contains: - specialized gen/rec routines for all RAIDZ levels, - new scalar raidz implementation (unrolled), - two x86_64 SIMD implementations (SSE and AVX2 instructions sets), - fastest routines selected on module load (benchmark). - cmd/raidz_test - verify and benchmark all implementations - added raidz_test to the ZFS Test Suite New zfs module parameters: - zfs_vdev_raidz_impl (str): selects the implementation to use. On module load, the parameter will only accept first 3 options, and the other implementations can be set once module is finished loading. Possible values for this option are: "fastest" - use the fastest math available "original" - use the original raidz code "scalar" - new scalar impl "sse" - new SSE impl if available "avx2" - new AVX2 impl if available See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to get the list of supported values. If an implementation is not supported on the system, it will not be shown. Currently selected option is enclosed in `[]`. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4328	2016-06-21 09:27:26 -07:00
Tim Chase	09fb30e5e9	Linux 4.6 compat: Fall back to d_prune_aliases() if necessary As of 4.6, the icache and dcache LRUs are memcg aware insofar as the kernel's per-superblock shrinker is concerned. The effect is that dcache or icache entries added by a task in a non-root memcg won't be scanned by the shrinker in the context of the root (or NULL) memcg. This defeats the attempts by zfs_sb_prune() to unpin buffers and can allow metadata to grow uncontrollably. This patch reverts to the d_prune_aliaes() method in case the kernel's per-superblock shrinker is not able to free anything. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes: #4726	2016-06-17 13:33:49 -07:00
Brian Behlendorf	f74b821a66	Add `zfs allow` and `zfs unallow` support ZFS allows for specific permissions to be delegated to normal users with the `zfs allow` and `zfs unallow` commands. In addition, non- privileged users should be able to run all of the following commands: * zpool [list \| iostat \| status \| get] * zfs [list \| get] Historically this functionality was not available on Linux. In order to add it the secpolicy_* functions needed to be implemented and mapped to the equivalent Linux capability. Only then could the permissions on the `/dev/zfs` be relaxed and the internal ZFS permission checks used. Even with this change some limitations remain. Under Linux only the root user is allowed to modify the namespace (unless it's a private namespace). This means the mount, mountpoint, canmount, unmount, and remount delegations cannot be supported with the existing code. It may be possible to add this functionality in the future. This functionality was validated with the cli_user and delegation test cases from the ZFS Test Suite. These tests exhaustively verify each of the supported permissions which can be delegated and ensures only an authorized user can perform it. Two minor bug fixes were required for test-running.py. First, the Timer() object cannot be safely created in a `try:` block when there is an unconditional `finally` block which references it. Second, when running as a normal user also check for scripts using the both the .ksh and .sh suffixes. Finally, existing users who are simulating delegations by setting group permissions on the /dev/zfs device should revert that customization when updating to a version with this change. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #362 Closes #434 Closes #4100 Closes #4394 Closes #4410 Closes #4487	2016-06-07 09:16:52 -07:00
Chunwei Chen	6a79672530	Fix memleak in vdev_config_generate_stats fnvlist_add_nvlist will copy the contents of nvx, so we need to free it here. unreferenced object 0xffff8800a6934e80 (size 64): comm "zpool", pid 3398, jiffies 4295007406 (age 214.180s) hex dump (first 32 bytes): 60 06 c2 73 00 88 ff ff 00 7c 8c 73 00 88 ff ff `..s.....\|.s.... 00 00 00 00 00 00 00 00 40 b0 70 c0 ff ff ff ff ........@.p..... backtrace: [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811fac7d>] __kmalloc_node+0x17d/0x310 [<ffffffffc065528c>] spl_kmem_alloc_impl+0xac/0x180 [spl] [<ffffffffc0657379>] spl_vmem_alloc+0x19/0x20 [spl] [<ffffffffc07056cf>] nv_alloc_sleep_spl+0x1f/0x30 [znvpair] [<ffffffffc07006b7>] nvlist_xalloc.part.13+0x27/0xc0 [znvpair] [<ffffffffc07007ad>] nvlist_alloc+0x3d/0x40 [znvpair] [<ffffffffc0703abc>] fnvlist_alloc+0x2c/0x80 [znvpair] [<ffffffffc07b1783>] vdev_config_generate_stats+0x83/0x370 [zfs] [<ffffffffc07b1f53>] vdev_config_generate+0x4e3/0x650 [zfs] [<ffffffffc07996db>] spa_config_generate+0x20b/0x4b0 [zfs] [<ffffffffc0794f64>] spa_tryimport+0xc4/0x430 [zfs] [<ffffffffc07d11d8>] zfs_ioc_pool_tryimport+0x68/0x110 [zfs] [<ffffffffc07d4fc6>] zfsdev_ioctl+0x646/0x7a0 [zfs] [<ffffffff81232e31>] do_vfs_ioctl+0xa1/0x5b0 [<ffffffff812333b9>] SyS_ioctl+0x79/0x90 Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4707 Issue #4708	2016-05-31 16:05:21 -07:00
Chunwei Chen	06ee0031a6	Fix memleak in zpl_parse_options strsep() will advance tmp_mntopts, and will change it to NULL on last iteration. This will cause strfree(tmp_mntopts) to not free anything. unreferenced object 0xffff8800883976c0 (size 64): comm "mount.zfs", pid 3361, jiffies 4294931877 (age 1482.408s) hex dump (first 32 bytes): 72 77 00 73 74 72 69 63 74 61 74 69 6d 65 00 7a rw.strictatime.z 66 73 75 74 69 6c 00 6d 6e 74 70 6f 69 6e 74 3d fsutil.mntpoint= backtrace: [<ffffffff81810c4e>] kmemleak_alloc+0x4e/0xb0 [<ffffffff811f9cac>] __kmalloc+0x16c/0x250 [<ffffffffc065ce9b>] strdup+0x3b/0x60 [spl] [<ffffffffc080fad6>] zpl_parse_options+0x56/0x300 [zfs] [<ffffffffc080fe46>] zpl_mount+0x36/0x80 [zfs] [<ffffffff81222dc8>] mount_fs+0x38/0x160 [<ffffffff81240097>] vfs_kern_mount+0x67/0x110 [<ffffffff812428e0>] do_mount+0x250/0xe20 [<ffffffff812437d5>] SyS_mount+0x95/0xe0 [<ffffffff8181aff6>] entry_SYSCALL_64_fastpath+0x1e/0xa8 [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4706 Issue #4708	2016-05-31 16:04:26 -07:00
Chunwei Chen	540c392793	Fix out-of-bound access in zfs_fillpage The original code will do an out-of-bound access on pl[] during last iteration. ================================================================== BUG: KASAN: stack-out-of-bounds in zfs_getpage+0x14c/0x2d0 [zfs] Read of size 8 by task tmpfile/7850 page:ffffea00017c6dc0 count:0 mapcount:0 mapping: (null) index:0x0 flags: 0xffff8000000000() page dumped because: kasan: bad access detected CPU: 3 PID: 7850 Comm: tmpfile Tainted: G OE 4.6.0+ #3 ffff88005f1b7678 0000000006dbe035 ffff88005f1b7508 ffffffff81635618 ffff88005f1b7678 ffff88005f1b75a0 ffff88005f1b7590 ffffffff81313ee8 ffffea0001ae8dd0 ffff88005f1b7670 0000000000000246 0000000041b58ab3 Call Trace: [<ffffffff81635618>] dump_stack+0x63/0x8b [<ffffffff81313ee8>] kasan_report_error+0x528/0x560 [<ffffffff81278f20>] ? filemap_map_pages+0x5f0/0x5f0 [<ffffffff813144b8>] kasan_report+0x58/0x60 [<ffffffffc12250dc>] ? zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffff81312e4e>] __asan_load8+0x5e/0x70 [<ffffffffc12250dc>] zfs_getpage+0x14c/0x2d0 [zfs] [<ffffffffc1252131>] zpl_readpage+0xd1/0x180 [zfs] [<ffffffff81353c3a>] SyS_execve+0x3a/0x50 [<ffffffff810058ef>] do_syscall_64+0xef/0x180 [<ffffffff81d0ee25>] entry_SYSCALL64_slow_path+0x25/0x25 Memory state around the buggy address: ffff88005f1b7500: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7580: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 >ffff88005f1b7600: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 f4 ^ ffff88005f1b7680: f4 f4 f3 f3 f3 f3 00 00 00 00 00 00 00 00 00 00 ffff88005f1b7700: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ================================================================== Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4705 Issue #4708	2016-05-31 16:01:27 -07:00
GeLiXin	b7faa7aabd	Fix self-healing IO prior to dsl_pool_init() completion Async writes triggered by a self-healing IO may be issued before the pool finishes the process of initialization. This results in a NULL dereference of `spa->spa_dsl_pool` in vdev_queue_max_async_writes(). George Wilson recommended addressing this issue by initializing the passed `dsl_pool_t **` prior to dmu_objset_open_impl(). Since the caller is passing the `spa->spa_dsl_pool` this has the effect of ensuring it's initialized. However, since this depends on the caller knowing they must pass the `spa->spa_dsl_pool` an additional NULL check was added to vdev_queue_max_async_writes(). This guards against any future restructuring of the code which might result in dsl_pool_init() being called differently. Signed-off-by: GeLiXin <47034221@qq.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4652	2016-05-27 14:11:25 -07:00
Tony Hutter	26ef0cc7db	OpenZFS 6531 - Provide mechanism to artificially limit disk performance Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6531 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130 Porting notes: - Added new IO delay tracepoints, and moved common ZIO tracepoint macros to a new trace_common.h file. - Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function. - Updated zinject man page - Updated zpool_scrub test files	2016-05-26 10:11:51 -07:00
Tony Hutter	7e945072d1	Add request size histograms (-r) to zpool iostat, minor man page fix Add -r option to "zpool iostat" to print request size histograms for the leaf ZIOs. This includes histograms of individual ZIOs ("ind") and aggregate ZIOs ("agg"). These stats can be useful for seeing how well the ZFS IO aggregator is working. $ zpool iostat -r mypool sync_read sync_write async_read async_write scrub req_size ind agg ind agg ind agg ind agg ind agg ---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- 512 0 0 0 0 0 0 530 0 0 0 1K 0 0 260 0 0 0 116 246 0 0 2K 0 0 0 0 0 0 0 431 0 0 4K 0 0 0 0 0 0 3 107 0 0 8K 15 0 35 0 0 0 0 6 0 0 16K 0 0 0 0 0 0 0 39 0 0 32K 0 0 0 0 0 0 0 0 0 0 64K 20 0 40 0 0 0 0 0 0 0 128K 0 0 20 0 0 0 0 0 0 0 256K 0 0 0 0 0 0 0 0 0 0 512K 0 0 0 0 0 0 0 0 0 0 1M 0 0 0 0 0 0 0 0 0 0 2M 0 0 0 0 0 0 0 0 0 0 4M 0 0 0 0 0 0 155 19 0 0 8M 0 0 0 0 0 0 0 811 0 0 16M 0 0 0 0 0 0 0 68 0 0 -------------------------------------------------------------------------------- Also rename the stray "-G" in the man page to be "-w" for latency histograms. Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #4659	2016-05-25 15:49:35 -07:00
Chunwei Chen	4442f60d8e	Fix arc_prune_task use-after-free arc_prune_task uses a refcount to protect arc_prune_t, but it doesn't prevent the underlying zsb from disappearing if there's a concurrent umount. We fix this by force the caller of arc_remove_prune_callback to wait for arc_prune_taskq to finish. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4687 Closes #4690	2016-05-25 14:11:53 -07:00
Chunwei Chen	cbecb4fb22	Skip ctldir znode in zfs_rezget to fix snapdir issues Skip ctldir in zfs_rezget, otherwise they will always get invalidated. This will cause funny behaviour for the mounted snapdirs. Especially for Linux >= 3.18, d_invalidate will detach the mountpoint and prevent anyone automount it again as long as someone is still using the detached mount. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4514 Closes #4661 Closes #4672	2016-05-23 11:06:56 -07:00
Chunwei Chen	9baaa7deae	Linux 4.7 compat: use iterate_shared for concurrent readdir Register iterate_shared if it exists so the kernel will used shared lock and allowing concurrent readdir. Also, use shared lock when doing llseek with SEEK_DATA or SEEK_HOLE to allow concurrent seeking. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4664 Closes #4665	2016-05-20 11:09:16 -07:00
Chunwei Chen	68e8f59afb	Linux 4.7 compat: replace blk_queue_flush with blk_queue_write_cache Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4665	2016-05-20 11:08:55 -07:00
Nikolay Borisov	278f223668	Kill znode->z_gen field This field is a duplicate of the inode->i_generation, so just kill it. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4538 Closes #4654	2016-05-19 13:06:14 -07:00
Brian Behlendorf	ada8258141	Revert "zhack: Add 'feature disable' command" This reverts commit `8302528617` and `ebecfcd699` which broke the build. While these patches do apply cleanly and passed previous test runs they need to be updated to account for the changes made in commit `241b541574`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3878	2016-05-17 11:52:07 -07:00

... 5 6 7 8 9 ...

1754 Commits