mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-03-22 08:51:30 +03:00

Author	SHA1	Message	Date
Brian Behlendorf	2db28197fe	Fix cppcheck warning in buf_init() Cppcheck 1.63 erroneously complains about an uninitialized value in buf_init(). Newer versions of cppcheck (1.72) handle this correctly but we'll initialize the value anyway to silence the warning. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5203	2016-09-30 15:04:21 -07:00
Gvozden Neskovic	6ca636a152	Avoid undefined shift overflow in fzap_cursor_retrieve() Avoid calculating (1<<64) if lh_prefix_len == 0. Semantics of the method remain the same. Assert (lh_prefix_len > 0) in zap_expand_leaf() to detect possibly the same problem. Issue #4883 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-09-29 15:55:41 -07:00
Gvozden Neskovic	4ca9c1de12	Explicit integer promotion for bit shift operations Explicitly promote variables to correct type. Undefined behavior is reported because length of int is not well defined by C standard. Issue #4883 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-09-29 15:55:41 -07:00
Gvozden Neskovic	031d7c2fe6	fix: Shift exponent too large Undefined operation is reported by running ztest (or zloop) compiled with GCC UndefinedBehaviorSanitizer. Error only happens on top level of dnode indirection with large enough offset values. Logically, left shift operation would work, but bit shift semantics in C, and limitation of uint64_t, do not produce desired result. Issue #5059, #4883 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>	2016-09-29 15:55:41 -07:00
Isaac Huang	e8ac4557af	Explicit block device plugging when submitting multiple BIOs Without plugging, the default 'noop' scheduler will not merge the BIOs which are part of a large ZIO. Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Isaac Huang <he.huang@intel.com> Closes #5181	2016-09-29 13:13:31 -07:00
cao	c9d61adbf8	Fix coverity defects: 147658, 147652, 147651 coverity scan CID:147658, Type:copy into fixed size buffer. coverity scan CID:147652, Type:copy into fixed size buffer. coverity scan CID:147651, Type:copy into fixed size buffer. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5160	2016-09-29 12:06:14 -07:00
lorddoskias	12fa7f3436	Refactor inode->i_mode management Refactor the code in such a way so that inode->i_mode is being set at the same time zp->z_mode is being changed. This has the effect of keeping both in sync without relying on zfs_inode_update. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Closes #5158	2016-09-27 14:08:52 -07:00
cao	680eada9b0	Fix coverity defects: CID 147650, 147649, 147647, 147646 coverity scan CID:147650, Type:copy into fixed size buffer. coverity scan CID:147649, Type:copy into fixed size buffer. coverity scan CID:147647, Type:copy into fixed size buffer. coverity scan CID:147646, Type:copy into fixed size buffer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Closes #5161	2016-09-25 15:08:28 -07:00
Brian Behlendorf	7571033285	Fix multilist_create() memory leak In arc_state_fini() the `arc_l2c_only->arcs_list[*]` multilists must be destroyed. This accidentally regressed in `d3c2ae1c`. Reviewed by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #5151 Closes #5152	2016-09-23 10:55:10 -07:00
tuxoko	d5b897a6a1	Linux 4.7 compat: Fix deadlock during lookup on case-insensitive We must not use d_add_ci if the dentry already has the real name. Otherwise, d_add_ci()->d_alloc_parallel() will find itself on the lookup hash and wait on itself causing deadlock. Tested-by: satmandu Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #5124 Closes #5141 Closes #5147 Closes #5148	2016-09-22 19:09:16 -07:00
kernelOfTruth aka. kOT, Gentoo user	51907a31bc	OpenZFS 7230 - add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records Authored by: Matt Krantz <matt.krantz@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: kernelOfTruth <kerneloftruth@gmail.com> OpenZFS-issue: https://www.illumos.org/issues/7230 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/12b90ee2 Closes #5112	2016-09-22 16:01:19 -07:00
luozhengzheng	160987b576	Fix coverity defects coverity scan CID:147633,type: sizeof not portable coverity scan CID:147637,type: sizeof not portable coverity scan CID:147638,type: sizeof not portable coverity scan CID:147640,type: sizeof not portable In these particular cases sizeof (XX *) happens to be equal to sizeof (X ), but this is not a portable assumption. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5144	2016-09-21 18:09:00 -07:00
Isaac Huang	da8d57488b	Reduce noise in tracing logs dbuf_read_impl() returns (SET_ERROR(err)) when err can be 0, which adds lots of noise in tracing logs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Isaac Huang <he.huang@intel.com> Closes #4430 Closes #5146	2016-09-21 13:37:20 -07:00
BearBabyLiu	609603a5d3	Fix coverity defects coverity scan CID:147504 Type: Explicit null dereferenced Reason: passing null pointer dl to zfs_dirent_unlock Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: BearBabyLiu <liu.huang@zte.com.cn> Closes #5131	2016-09-20 19:09:22 -07:00
Tim Chase	25e2ab16be	Fix arc_adjust_meta_balanced() The type of "adjustmnt" was erroneously changed to unsigned when the compressed ARC code was ported in `d3c2ae1c08`. As a result of it being unsigned, the balanced metadata eviction logic would evict all of the non-metadata. Reviewed-by: Chris Severance <github.severach@spamgourmet.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: David Quigley <david.quigley@intel.com> Signed-off-by: Tim Chase <tim@onlight.com> Closes #5128 Closes #5129	2016-09-19 09:28:35 -07:00
luozhengzheng	30f3f2e13c	Fix Coverity defects CID 147659, 150952 and 147645 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Closes #5103	2016-09-17 15:08:54 -07:00
Brian Behlendorf	9ea9e0b9a1	Enable ignore_hole_birth module option by default Enable ignore_hole_birth by default until all known hole birth bugs have been resolved and relevant test cases added. Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4809 Closes #5099	2016-09-16 14:05:30 -07:00
Nikolay Borisov	87f9371aef	Simplify time handling logic in zfs_settattr Simplify time handling in zfs_setattr by mimicking the logic in setattr_copy from the linux kernel. In order to achieve this in the case when ZFS' log is being replayed it is necessary to unconditionally set the ctime in zfs_replay_setattr. Also use the timespec_trunc function when assigning values to the generic inode struct. This is currently a noop since zfs sets s_time_gran to 1, however in the future rules about precision might change. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Closes #4916	2016-09-13 12:00:18 -07:00
Nikolay Borisov	9f5f0019ab	Refactor generic inode time updating ZFS doesn't provide a custom update_time method meaning it delegates this job to the generic VFS layer. The only time when it needs to set the various *time values is when the inode is being marshalled to/from the disk. Do this by moving the relevant code from zfs_inode_update_impl to zfs_node_alloc and zfs_rezget. As a result from this change it is no longer necessary to have multiple versions of the zfs_inode_update function - so just nuke them and leave only one. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Issue #227 Closes #4916	2016-09-13 11:57:37 -07:00
Dan Kimmel	524b4217b8	DLPX-44733 combine arc_buf_alloc_impl() with arc_buf_clone() Authored by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: David Quigley <david.quigley@intel.com> Issue #5078	2016-09-13 09:59:13 -07:00
Tom Caputi	c17bcf83da	Enable raw writes to perform dedup with verification Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: David Quigley <david.quigley@intel.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue #5078	2016-09-13 09:59:04 -07:00
Dan Kimmel	2aa34383b9	DLPX-40252 integrate EP-476 compressed zfs send/receive Authored by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: David Quigley <david.quigley@intel.com> Issue #5078	2016-09-13 09:58:58 -07:00
George Wilson	d3c2ae1c08	OpenZFS 6950 - ARC should cache compressed data Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: David Quigley <david.quigley@intel.com> This review covers the reading and writing of compressed arc headers, sharing data between the arc_hdr_t and the arc_buf_t, and the implementation of a new dbuf cache to keep frequently access data uncompressed. I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block for that DVA. The physical block may or may not be compressed. If compressed arc is enabled and the block on-disk is compressed, then the b_pdata will match the block on-disk and remain compressed in memory. If the block on disk is not compressed, then neither will the b_pdata. Lastly, if compressed arc is disabled, then b_pdata will always be an uncompressed version of the on-disk block. Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict any arc_buf_t's that are no longer referenced. This means that the arc will primarily have compressed blocks as the arc_buf_t's are considered overhead and are always uncompressed. When a consumer reads a block we first look to see if the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t and bcopy the uncompressed contents from the first arc_buf_t to the new one. Writing to the compressed arc requires that we first discard the b_pdata since the physical block is about to be rewritten. The new data contents will be passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we will copy the physical block contents to a newly allocated b_pdata. When an l2arc is inuse it will also take advantage of the b_pdata. Now the l2arc will always write the contents of b_pdata to the l2arc. This means that when compressed arc is enabled that the l2arc blocks are identical to those stored in the main data pool. This provides a significant advantage since we can leverage the bp's checksum when reading from the l2arc to determine if the contents are valid. If the compressed arc is disabled, then we must first transform the read block to look like the physical block in the main data pool before comparing the checksum and determining it's valid. OpenZFS-issue: https://www.illumos.org/issues/6950 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7fc10f0 Issue #5078	2016-09-13 09:58:33 -07:00
Tim Chase	43924bfeaa	Remove redundant assignments to arc_c Several assignments to arc_c had no effect because it is ultimately initialized to arc_c_max. This aligns ZoL better with the upstream code which removed these assignments some time ago. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@onlight.com> Closes #5081	2016-09-12 12:40:30 -07:00
Nikolay Borisov	67d6082494	Refactor spa_load_l2cache to make build happy In case sav->sav_config was NULL the body of the function would skip the iteration of the l2 cache devices and will just cleanup the old devices. However, this wasn't very obvious since the null check was performed after the loop body and after the old devices were cleaned. Refactor the code so that it's now obvious when the iteration of the l2cache devices is skipped. This fixes the following cppcheck warning: [module/zfs/spa.c:1552]: (error) Possible null pointer dereference: newvdevs Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Closes #5087	2016-09-12 12:40:03 -07:00
Tim Chase	20aa7a4e31	Free property names with spa_strfree() rather than strfree() Since they're allocated with spa_strdup(), they should be freed with spa_strfree() so the proper length buffer is freed. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #5082 Closes #5086	2016-09-12 09:45:26 -07:00
Don Brady	d02ca37979	Bring over illumos ZFS FMA logic -- phase 1 This first phase brings over the ZFS SLM module, zfs_mod.c, to handle auto operations in response to disk events. Disk event monitoring is provided from libudev and generates the expected payload schema for zfs_mod. This work leverages the recently added devid and phys_path strings in the vdev label. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #4673	2016-09-01 11:39:45 -07:00
luozhengzheng	0b284702b7	Delete unreferenced function zfs_ereport_send_interim_checksum Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5055	2016-09-01 11:39:45 -07:00
luozhengzheng	ca8587a517	kmem_zalloc with KM_SLEEP will never return NULL These allocations can never fail. Leaving the error handling code here gives the impression they can so it has been removed. Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5048	2016-09-01 11:39:45 -07:00
Gvozden Neskovic	ee36c709c3	Performance optimization of AVL tree comparator functions perf: 2.75x faster ddt_entry_compare() First 256bits of ddt_key_t is a block checksum, which are expected to be close to random data. Hence, on average, comparison only needs to look at first few bytes of the keys. To reduce number of conditional jump instructions, the result is computed as: sign(memcmp(k1, k2)). Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} , which is computed efficiently. Synthetic performance evaluation of original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R) CPU E5-2660 v3: old 6.85789 s new 2.49089 s perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare() Compute the result directly instead of using conditionals perf: zfs_range_compare() Speedup between 1.1x - 2.5x, depending on compiler version and optimization level. perf: spa_error_entry_compare() `bcmp()` is not suitable for comparator use. Use `memcmp()` instead. perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare() perf: 2.8x faster zil_bp_compare() perf: 2.8x faster mze_compare() perf: faster dbuf_compare() perf: faster compares in spa_misc perf: 2.8x faster layout_hash_compare() perf: 2.8x faster space_reftree_compare() perf: libzfs: faster avl tree comparators perf: guid_compare() perf: dsl_deadlist_compare() perf: perm_set_compare() perf: 2x faster range_tree_seg_compare() perf: faster unique_compare() perf: faster vdev_cache _compare() perf: faster vdev_uberblock_compare() perf: faster fuid _compare() perf: faster zfs_znode_hold_compare() Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Richard Elling <richard.elling@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5033	2016-08-31 14:35:34 -07:00
Hajo Möller	82ab6848cc	Fix "zpool get guid,freeing,leaked" source `zpool get guid,freeing,leaked` shows SOURCE as `default`, it should be `-` as those props are not editable. Changed code to not overwrite `src` for `ZPOOL_PROP_VERSION`, so it stays `ZPROP_SRC_NONE`. Make src const to avoid future mistakes Signed-off-by: Hajo Möller <dasjoe@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4170	2016-08-30 15:57:15 -07:00
cao	8f50bafb04	Delete unused zfsctl_snapdir_inactive declaration zfsctl_snapdir_inactive is defined in zfs-0.6.3. In zfs-0.6.5.7 this is declaration remains even though the implementation was removed in commit `278bee93`. Removed fastreboot_disable_highpil which is also unused. Signed-off-by: caoxuewen cao.xuewen@zte.com.cn Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5042	2016-08-30 14:33:40 -07:00
Simon Klinkert	db707ad094	OpenZFS 6940 - Cannot unlink directories when over quota From user perspective, I would expect that ZFS is always able to remove files and directories even when the quota is exceeded. Authored by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6940 OpenZFS-issue: https://www.illumos.org/issues/6334 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/9918916 Closes #5044	2016-08-30 14:33:04 -07:00
Alexander Motin	755065f3dc	OpenZFS 6322 - ZFS indirect block predictive prefetch For quite some time I was thinking about possibility to prefetch ZFS indirection tables while doing sequential reads or writes. Recent changes in predictive prefetcher made that much easier to do. My tests on zvol with 16KB block size on 5x striped and 2x mirrored pool of 10 disks show almost double throughput on sequential read, and almost tripple on sequential rewrite. While for read alike effect can be received from increasing maximal prefetch distance (though at higher memory cost), for rewrite there is no other solution so far. Authored by: Alexander Motin <mav@freebsd.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6322 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/cb92f413 Closes #5040 Porting notes: - Change from upstream in module/zfs/dbuf.c in 'int dbuf_read' due to commit `5f6d0b6` 'Handle block pointers with a corrupt logical size' - Difference from upstream in module/zfs/dmu_zfetch.c, uint32_t zfetch_max_idistance -> unsigned int zfetch_max_idistance - Variables have been initialized at the beginning of the function (void dmu_zfetch) to resemble the order of occurrence and account for C99, C11 mode errors.	2016-08-30 14:26:55 -07:00
Matthew Ahrens	98ace739bd	OpenZFS 7086 - ztest attempts dva_get_dsize_sync on an embedded blockpointer In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at the db_blkptr, to prevent it from being changed by syncing context. Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7086 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/98fa317 Closes #5039	2016-08-30 14:25:50 -07:00
GeLiXin	c40db193a5	Fix: Build warnings with different gcc optimization levels in debug mode This fix resolves warnings reported during compiling with different gcc optimization levels in debug mode, Test tools: gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC) Linux version: 2.6.32-573.18.1.el6.x86_64, Red Hat Enterprise Linux Server release 6.1 (Santiago) List of warnings: CFLAGS=-O1 ./configure --enable-debug ;make ../../module/icp/core/kcf_sched.c: In function ‘kcf_aop_done’: ../../module/icp/core/kcf_sched.c:499: error: ‘fg’ may be used uninitialized in this function ../../module/icp/core/kcf_sched.c:499: note: ‘fg’ was declared here CFLAGS=-Os ./configure --enable-debug ; make libzfs_dataset.c: In function ‘zfs_prop_set_list’: libzfs_dataset.c:1575: error: ‘nvl_len’ may be used uninitialized in this function Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5022	2016-08-29 12:46:18 -07:00
GeLiXin	9907cc1cc8	Add zfs_arc_meta_limit_percent tunable ARC will evict meta buffers that exceed the arc_meta_limit. Before a further investigating on whether we should take special protection on meta buffers, this tunable make arc_meta_limit adjustable for different workloads. People can set zfs_arc_meta_limit_percent to any value while insmod zfs.ko, so some range check is added to guarantee a suitable arc_meta_limit. Suggested by Tim Chase, zfs_arc_dnode_limit is changed to a percent-style tunable as well. Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4957	2016-08-23 13:03:01 -07:00
Tim Chase	3e635ac15c	Prevent reclaim in send_traverse_thread() As is the case with traverse_prefetch_thread(), the deep stacks caused by traversal require disabling reclaim in the send traverse thread. Also, do the same for receive_writer_thread() in which similar problems have been observed. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4912 Closes #4998	2016-08-22 16:12:05 -07:00
Gvozden Neskovic	9cc1844a1d	Linux compat: Grsecurity kernel API Change: Module parameter set/get methods take const parameter in Grsecurity kernel v4.7.1 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Jason Zaman <jason@perfinion.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4997 Closes #5001	2016-08-22 10:05:45 -07:00
Matthew Ahrens	2bce8049c3	OpenZFS 7004 - dmu_tx_hold_zap() does dnode_hold() 7x on same object Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held). dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created: dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write() This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds. Sponsored by: Intel Corp. Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7004 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/109 Closes #4641 Closes #4972	2016-08-19 12:48:03 -07:00
Matthew Ahrens	8bea981504	OpenZFS 7003 - zap_lockdir() should tag hold zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which tags the hold on the zap. This will help diagnose programming errors which misuse the hold on the ZAP. Sponsored by: Intel Corp. Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Pavel Zakharov <pavel.zakha@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7003 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/108 Closes #4972	2016-08-19 12:35:23 -07:00
heary-cao	ee6370a7a4	Fix spa config generate memory leak in spa_load_best function When spa retry load succeeds and spa recovery is requested it may leak in spa_load_best function. Always free the generated config when it is not assigned to the spa. Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4940	2016-08-19 11:17:12 -07:00
Paul Dagnelie	32d41fb73a	OpenZFS 7176 - Yet another hole birth issue This is another bug in the long line of hole-birth related issues. In this particular case, it was discovered that a previous hole-birth fix (illumos bug 6513, commit `bc77ba73`) did not cover as many cases as we thought it did. While the issue worked in the case of hole-punching (writing zeroes to a large part of a file), it did not deal with truncation, and then writing beyond the new end of the file. The problem is that dbuf_findbp will return ENOENT if the block it's trying to find is beyond the end of the file. If that happens, we assume there is no birth time, and so we lose that information when we write out new blkptrs. We should teach dbuf_findbp to look for things that are beyond the current end, but not beyond the absolute end of the file. Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens mahrens@delphix.com Reviewed by: George Wilson george.wilson@delphix.com Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7176 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/173/commits/8b9f3ad Upstream-bugs: DLPX-46009 Porting notes: - Fix ISO C90 mixed declaration error in dbuf.c ( int nlevels, epbs; ) ; keep previous position of the initialization	2016-08-18 09:26:44 -07:00
Matthew Ahrens	d9eea113f8	It is not necessary to zero struct dbuf_hold_impl_data Under a workload which makes heavy use of `dbuf_hold()`, I noticed that a considerable amount of time was spent in `dbuf_hold_impl()`, due to its call to `kmem_zalloc(sizeof (struct dbuf_hold_impl_data) * DBUF_HOLD_IMPL_MAX_DEPTH)`, which is around 2KiB. This structure is used as a stack, to limit the size of the C stack as dbuf_hold() calls itself recursively. We make a recursive call to hold the parent's dbuf when the requested dbuf is not found. The vast majority of the time, the parent or grandparent indirect dbuf is cached, so the number of recursive calls is very low. However, we initialize this entire array for every call to dbuf_hold(). To improve performance, this commit changes `dbuf_hold()` to use `kmem_alloc()` instead of `kmem_zalloc()`. __dbuf_hold_impl_init is changed to initialize all members of the struct before they are used. I observed ~5% performance improvement on a workload which creates many files. Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4974	2016-08-16 15:27:17 -07:00
Gvozden Neskovic	fc897b24b2	Rework of fletcher_4 module - Benchmark memory block is increased to 128kiB to reflect real block sizes more accurately. Measurements include all three stages needed for checksum generation, i.e. `init()/compute()/fini()`. The inner loop is repeated multiple times to offset overhead of time function. - Fastest implementation selects native and byteswap methods independently in benchmark. To support this new function pointers `init_byteswap()/fini_byteswap()` are introduced. - Implementation mutex lock is replaced by atomic variable. - To save time, benchmark is not executed in userspace. Instead, highest supported implementation is used for fastest. Default userspace selector is still 'cycle'. - `fletcher_4_native/byteswap()` methods use incremental methods to finish calculation if data size is not multiple of vector stride (currently 64B). - Added `fletcher_4_native_varsize()` special purpose method for use when buffer size is not known in advance. The method does not enforce 4B alignment on buffer size, and will ignore last (size % 4) bytes of the data buffer. - Benchmark `kstat` is changed to match the one of vdev_raidz. It now shows throughput for all supported implementations (in B/s), native and byteswap, as well as the code [fastest] is running. Example of `fletcher_4_bench` running on `Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz`: implementation native byteswap scalar 4768120823 3426105750 sse2 7947841777 4318964249 ssse3 7951922722 6112191941 avx2 13269714358 11043200912 fastest avx2 avx2 Example of `fletcher_4_bench` running on `Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz`: implementation native byteswap scalar 1291115967 1031555336 sse2 2539571138 1280970926 ssse3 2537778746 1080016762 avx2 4950749767 1078493449 avx512f 9581379998 4010029046 fastest avx512f avx512f Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4952	2016-08-16 14:11:55 -07:00
Gvozden Neskovic	70b258fc96	Fletcher4 implementation using avx512f instruction set Algorithm runs 8 parallel sums, consuming 8x uint32_t elements per loop iteration. Size alignment of main fletcher4 methods is adjusted accordingly. New implementation is called 'avx512f'. Note: byteswap method can be implemented more efficiently when avx512bw hardware becomes available. Currently, it is ~ 2x slower than native method. Table shows result of full (native) fletcher4 calculation for different buffer size: fletcher4 4KB 16KB 64KB 128KB 256KB 1MB 16MB -------------------------------------------------------------------- [scalar] 1213 1228 1231 1231 1225 1200 1160 [sse2] 2374 2442 2459 2456 2462 2250 2220 [avx2] 4288 4753 4871 4893 4900 4050 3882 [avx512f] 5975 8445 9196 9221 9262 6307 5620 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4952	2016-08-16 14:11:14 -07:00
Rich Ercolani	6d836e6f8b	Add tunable to ignore hole_birth Adds a module option which disables the hole_birth optimization which has been responsible for several recent bugs, including issue #4050. Original-patch: https://gist.github.com/pcd1193182/2c0cd47211f3aee623958b4698836c48 Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4833	2016-08-15 09:52:56 -07:00
GeLiXin	e35c5a8265	Fix incorrect pool state after import Import a raidz pool which has a vdev with a bad label, zpool status shows the right state of the dev, but the wrong state of the pool. The pool state should be DEGRADED, not ONLINE. We examine the label in vdev_validate while in spa_load_impl, the bad label can be detected but doesn't propagate its state to the parent. There are other chances to propagate state in the following vdev_load if we failed to load DTL, but our pool is raidz1 which can tolerate a faulted disk. So we lost the last chance to correct the pool state. Propagate the leaf vdev's state to parent if its label was corrupted, as is done elsewhere in vdev_validate. Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com> Closes #4948	2016-08-12 13:46:51 -07:00
Hans Rosenfeld	fb390aafc8	OpenZFS 5997 - FRU field not set during pool creation and never updated Authored by: Hans Rosenfeld <hans.rosenfeld@nexenta.com> Reviewed by: Dan Fields <dan.fields@nexenta.com> Reviewed by: Josef Sipek <josef.sipek@nexenta.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Signed-off-by: Don Brady <don.brady@intel.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/5997 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1437283 Porting Notes: In addition to the OpenZFS changes this patch realigns the events with those found in OpenZFS. Events which would be logged as sysevents on illumos have been been mapped to the 'sysevent' class for Linux. In addition, several subclass names have been changed to match what is used in OpenZFS. In all cases this means a '.' was changed to an '_' in the subclass. The scripts provided by ZoL have been updated, however users which provide scripts for any of the following events will need to rename them based on the new subclass names. ereport.fs.zfs.config.sync sysevent.fs.zfs.config_sync ereport.fs.zfs.zpool.destroy sysevent.fs.zfs.pool_destroy ereport.fs.zfs.zpool.reguid sysevent.fs.zfs.pool_reguid ereport.fs.zfs.vdev.remove sysevent.fs.zfs.vdev_remove ereport.fs.zfs.vdev.clear sysevent.fs.zfs.vdev_clear ereport.fs.zfs.vdev.check sysevent.fs.zfs.vdev_check ereport.fs.zfs.vdev.spare sysevent.fs.zfs.vdev_spare ereport.fs.zfs.vdev.autoexpand sysevent.fs.zfs.vdev_autoexpand ereport.fs.zfs.resilver.start sysevent.fs.zfs.resilver_start ereport.fs.zfs.resilver.finish sysevent.fs.zfs.resilver_finish ereport.fs.zfs.scrub.start sysevent.fs.zfs.scrub_start ereport.fs.zfs.scrub.finish sysevent.fs.zfs.scrub_finish ereport.fs.zfs.bootfs.vdev.attach sysevent.fs.zfs.bootfs_vdev_attach	2016-08-12 13:06:48 -07:00
Jason Zaman	a3600a106d	icp: mark asm files with noexec stack If there is no explicit note in the .S files, the obj file will mark it as requiring an executable stack. This is unneeded and causes issues on hardened systems. More info: https://wiki.gentoo.org/wiki/Hardened/GNU_stack_quickstart Signed-off-by: Jason Zaman <jason@perfinion.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4947 Closes #4962	2016-08-12 09:51:26 -07:00
Jason Zaman	a9947ce771	icp: add no_const for PaX Compat The constify plugin will automatically constify a class of types that contain only function pointers. The icp structs fail to build if this is enabled with the following error. The no_const attribute makes the plugin skip those structs. module/icp/spi/kcf_spi.c: In function ‘copy_ops_vector_v1’: module/icp/spi/kcf_spi.c:61:16: error: assignment of read-only location ‘dst_ops->cou.cou_v1.co_control_ops’ ((dst)->ops) = *((src)->ops); ^ module/icp/spi/kcf_spi.c:74:2: note: in expansion of macro ‘KCF_SPI_COPY_OPS’ KCF_SPI_COPY_OPS(src_ops, dst_ops, co_control_ops); ^ Signed-off-by: Jason Zaman <jason@perfinion.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4947 Closes #4962	2016-08-12 09:51:19 -07:00
Matthew Ahrens	169ab07cc8	OpenZFS 7263 - deeply nested nvlist can overflow stack nvlist_pack() and nvlist_unpack are implemented recursively, which can cause the stack to overflow with a deeply nested nvlist; i.e. an nvlist which contains an nvlist, which contains an nvlist, which... Unprivileged users can pass an nvlist to the kernel via certain ioctls on /dev/zfs, which the kernel will unpack without additional permission checking or validation. Therefore, an unprivileged user can cause the kernel's stack to overflow and panic. Ideally, these functions would be implemented non-recursively. As a quick fix, this patch limits the depth of the recursion and returns an error when attempting to pack and unpack a deeply-nested nvlist. Signed-off-by: Adam Leventhal <ahl@delphix.com> Signed-off-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Prakash Surya <prakash.surya@delphix.com> OpenZFS-issue: https://www.illumos.org/issues/7263 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0511d6d -	2016-08-11 15:58:03 -07:00
Chen Haiquan	d9c97ec08b	Use file_dentry and file_inode wrappers Fix bugs due to kernel change in torvalds/linux@4bacc9c923 ("overlayfs: Make f_path always point to the overlay and f_inode to the underlay"). This problem crashes system when use zfs as a layer of overlayfs. Signed-off-by: Chen Haiquan <oc@yunify.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4914 Closes #4935	2016-08-11 12:06:37 -07:00
GeLiXin	d5884c3453	Fix indefinite article The indefinite article before nvlist should be "an", not "a". We have 27 "an nvlist" and 7 "a nvlist" in our comment, they should stay the same as we are such a strict filesystem. Signed-off-by: GeLiXin <ge.lixin@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4941	2016-08-11 11:23:49 -07:00
Brian Behlendorf	e5fe9ddeec	Remove custom root pool import code Non-Linux OpenZFS implementations require additional support to be used a root pool. This code should simply be removed to avoid confusion and improve readability. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4951	2016-08-11 11:19:34 -07:00
Brian Behlendorf	cf41432c70	Linux 4.8 compat: Fix removal of bio->bi_rw member All users of bio->bi_rw have been replaced with compatibility wrappers. This allows the kernel specific logic to be abstracted away, and for each of the supported cases to be documented with the wrapper. The updated interfaces are as follows: * void blk_queue_set_write_cache(struct request_queue , bool, bool) boolean_t bio_is_flush(struct bio ) boolean_t bio_is_fua(struct bio ) boolean_t bio_is_discard(struct bio ) boolean_t bio_is_secure_erase(struct bio ) VDEV_WRITE_FLUSH_FUA Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4951	2016-08-11 11:19:34 -07:00
Gvozden Neskovic	689f093ebc	Build user-space with different gcc optimization levels This fix resolves warnings reported during compiling of user-space libraries with different gcc optimization levels. Tested with gcc versions: 4.9.2 (Debian), and 6.1.1 (Fedora). The patch enables use of following opt levels: O0, O1, O2, O3, Og, Os, Ofast. List of warnings: [GCC 4.9.2 -Os] libzfs_sendrecv.c:3726:26: error: 'clp' may be used uninitialized in this function [-Werror=maybe-uninitialized] [GCC 4.9.2 -Og] fs_fletcher.c:323:26: error: 'idx' may be used uninitialized in this function [-Werror=maybe-uninitialized] dsl_dataset.c:1290:12: error: 'atp' may be used uninitialized in this function [-Werror=maybe-uninitialized] [GCC 4.9.2 -Ofast] u8_textprep.c:1310:9: error: 'tc[3ul]' may be used uninitialized in this function [-Werror=maybe-uninitialized] u8_textprep.c:177:23: error: 'u8t[0ul]' may be used uninitialized in this function [-Werror=maybe-uninitialized] dsl_dataset.c:2089:37: error: ‘hds’ may be used uninitialized in this function [-Werror=maybe-uninitialized] dsl_dataset.c:3216:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized] dsl_dataset.c:1591:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized] dsl_dataset.c:3341:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized] vdev_raidz.c:1153:8: error: 'dcount[2]' may be used uninitialized in this function [-Werror=maybe-uninitialized] vdev_raidz.c:1167:17: error: 'dst[2]' may be used uninitialized in this function [-Werror=maybe-uninitialized] kernel.c:1005:2: error: ‘resid’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:2826:8: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:3056:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:1584:13: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:3056:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:1792:66: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] libzfs_dataset.c:3986:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized] [GCC 6.1.1] Resolved in PR #4907 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4937	2016-08-09 14:40:35 -07:00
Chunwei Chen	afb6c031e8	Linux 4.7 compat: fix zpl_get_acl returns invalid acl pointer Starting from Linux 4.7, get_acl will set acl cache pointer to temporary sentinel value before calling i_op->get_acl. Therefore we can't compare against ACL_NOT_CACHED and return. Since from Linux 3.14, get_acl already check the cache for us, so we disable this in zpl_get_acl. Linux 4.7 also does set_cached_acl for us so we disable it in zpl_get_acl. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4944 Closes #4946	2016-08-09 10:03:04 -07:00
Brian Behlendorf	4b908d3220	Linux 4.8 compat: posix_acl_valid() The posix_acl_valid() function has been updated to require a user namespace. Filesystem callers should normally provide the user_ns from the super block associcated with the ACL; the zpl_posix_acl_valid() wrapper has been added for this purpose. See https://github.com/torvalds/linux/commit/0d4d717f for complete details. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4922	2016-08-08 11:46:40 -07:00
Brian Behlendorf	e85a6396b0	Retire HAVE_CURRENT_UMASK and HAVE_POSIX_ACL_CACHING Remove ZFS_AC_KERNEL_CURRENT_UMASK and ZFS_AC_KERNEL_POSIX_ACL_CACHING configure checks, all supported kernel provide this functionality. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4922	2016-08-08 11:46:32 -07:00
Nikolay Borisov	64aefee1b8	Fix interaction between userns uid/gid and SA * When the uid/gid change is handled in zfs_setattr we want to actually adjust the user passed uid to a KUID and write that to disk. * In trace points use the i_uid member without doing translation, since it has already been performed. * Use kuid in zfs_aclset_common Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4928	2016-08-08 10:47:43 -07:00
Gaurav Kumar	cf2731e65b	arc_meta_limit should be updated when arc_max is changed. When arc_max is increased, arc_meta_limit will not be updated to 3/4 of the new arc_c_max value. This was done originally to preserve any existing maximum value. This turned out to be counter intuitive to users and this fix changes that behavior. If zfs_arc_meta_limit is non-default, it will be picked up later in the ARC tuning function. Signed-off-by: Gaurav Kumar <gaurav.kumar@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4893	2016-08-02 13:43:36 -07:00
Brian Behlendorf	efe7978d89	Fix gcc self-comparison warning As of gcc 6.1.1 20160621 (Red Hat 6.1.1-3) a self-comparison is detected by gcc in metaslab_alloc(). Resolve the warning by passing a physical size of 0 to BP_SET_BIRTH() as it done by other callers. module/zfs/metaslab.c: In function ‘metaslab_alloc’: module/zfs/metaslab.c:2575:184: error: self-comparison always evaluates to true [-Werror=tautological-compare] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Issue #4907	2016-08-02 13:14:18 -07:00
Tony Hutter	4eb0db42d3	Fix possible VDEV stats array overflow Fix a possible VDEV statistics array overflow when ZIOs with ZIO_PRIORITY_NOW complete. Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4883 Closes #4917	2016-08-02 08:45:24 -07:00
Nikolay Borisov	ba2fe6affb	Move assignment of i_blkbits field Currently i_blkbits is always set to SPA_MINBLOCKSHIFT every time zfs_inode_update_impl is called. Since this value never changes move its assignment to at inode creation time. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4906	2016-07-29 15:34:12 -07:00
Nikolay Borisov	e334e828a6	Unify license of icp module with the rest of zfs The newly added icp module uses a hardcoded value of CDDL for the license, however in local development one might want to change that to something else in order to facilitate compiling against lock debugging enabled kernel. All modules of the zfs use the ZFS_META_LICNSE string which is replaced with the value held in the META file. One can modify the value in the META file once and then rerun the configure to have all modules' licenses changed. Change the icp module license string to be ZFS_META_LICENSE so that it falls under the same paradigm. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4905	2016-07-29 15:34:12 -07:00
heary-cao	9f3d1407dc	Fix zfs_allow_log_destroy() NULL dereference In zfs_ioc_log_history() function the tsd_set() function is called with NULL which causes the zfs_allow_log_destroy() to be run. In this case the passed value will be NULL. This is normally entirely safe because strfree() maps directly to kfree() which may be passed a NULL. However, since alternate implementations of strfree() may not handle this gracefully add a check for NULL. Observed under an embedded Linux 2.6.32.41 kernel running the automated testing while running the ZFS Test Suite. Signed-off-by: caoxuewen <cao.xuewen@zte.com.cn> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4872	2016-07-29 15:34:12 -07:00
Chunwei Chen	3b86aeb295	Linux 4.8 compat: REQ_OP and bio_set_op_attrs() New REQ_OP_* definitions have been introduced to separate the WRITE, READ, and DISCARD operations from the flags. This included changing the encoding of bi_rw. It places REQ_OP_* in high order bits and other stuff in low order bits. This encoding is done through the new helper function bio_set_op_attrs. For complete details refer to: https://github.com/torvalds/linux/commit/f215082 https://github.com/torvalds/linux/commit/4e1b2d5 Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4892 Closes #4899	2016-07-29 14:48:19 -07:00
Brian Behlendorf	bbb1b6cea7	Linux 4.8 compat: submit_bio() The rw argument has been removed from submit_bio/submit_bio_wait. Callers are now expected to set bio->bi_rw instead of passing it in. See https://github.com/torvalds/linux/commit/4e49ea4a for complete details. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4892 Issue #4899	2016-07-29 14:48:00 -07:00
Richard Yao	f26b4b3c8a	txg visibility code should not execute under tc_open_lock The memory allocation and locking in `spa_txg_history_*()` can potentially block txg_hold_open for arbitrarily long periods of time. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4333	2016-07-27 14:11:13 -07:00
Brian Behlendorf	fcf64f45d9	Fix zdb crash with 4K-only devices Here's the problem - on 4K native devices in userland on Linux using O_DIRECT, buffers must be 4K aligned or I/O will fail with EINVAL, causing zdb (and others) to coredump. Since userland probably doesn't need optimized buffer caches, we just force 4K alignment on everything. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #4479	2016-07-27 13:38:46 -07:00
Colin Ian King	bf18fd89f9	void integer overflow on computation of refquota_slack DMU_MAX_ACCESS should be cast to a uint64_t otherwise the multiplication of DMU_MAX_ACCESS with spa_asize_inflation will be 32 bit and may lead to an overflow. Currently DMU_MAX_ACCESS is 64 * 1024 * 1024, so spa_asize_inflation being 64 or more will lead to an overflow. Found by static analysis with CoverityScan 0.8.5 CID 150942 (#1 of 1): Unintentional integer overflow (OVERFLOW_BEFORE_WIDEN) overflow_before_widen: Potentially overflowing expression 67108864 * spa_asize_inflation with type int (32 bits, signed) is evaluated using 32-bit arithmetic, and then used in a context that expects an expression of type uint64_t (64 bits, unsigned). Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4889	2016-07-27 13:38:46 -07:00
Gvozden Neskovic	a64f903b06	Fixes for issues found with cppcheck tool The patch fixes small number of errors/false positives reported by `cppcheck`, static analysis tool for C/C++. cppcheck 1.72 $ cppcheck . --force --quiet [cmd/zfs/zfs_main.c:4444]: (error) Possible null pointer dereference: who_perm [cmd/zfs/zfs_main.c:4445]: (error) Possible null pointer dereference: who_perm [cmd/zfs/zfs_main.c:4446]: (error) Possible null pointer dereference: who_perm [cmd/zpool/zpool_iter.c:317]: (error) Uninitialized variable: nvroot [cmd/zpool/zpool_vdev.c:1526]: (error) Memory leak: child [lib/libefi/rdwr_efi.c:1118]: (error) Memory leak: efi_label [lib/libuutil/uu_misc.c:207]: (error) va_list 'args' was opened but not closed by va_end(). [lib/libzfs/libzfs_import.c:1554]: (error) Dangerous usage of 'diskname' (strncpy doesn't always null-terminate it). [lib/libzfs/libzfs_sendrecv.c:3279]: (error) Dereferencing 'cp' after it is deallocated / released [tests/zfs-tests/cmd/file_write/file_write.c:154]: (error) Possible null pointer dereference: operation [tests/zfs-tests/cmd/randfree_file/randfree_file.c:90]: (error) Memory leak: buf [cmd/zinject/zinject.c:1068]: (error) Uninitialized variable: dataset [module/icp/io/sha2_mod.c:698]: (error) Uninitialized variable: blocks_per_int64 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1392	2016-07-27 13:31:22 -07:00
Tim Chase	25458cbef9	Limit the amount of dnode metadata in the ARC Metadata-intensive workloads can cause the ARC to become permanently filled with dnode_t objects as they're pinned by the VFS layer. Subsequent data-intensive workloads may only benefit from about 25% of the potential ARC (arc_c_max - arc_meta_limit). In order to help track metadata usage more precisely, the other_size metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size. The new zfs_arc_dnode_limit tunable, which defaults to 10% of zfs_arc_meta_limit, defines the minimum number of bytes which is desirable to be consumed by dnodes. Attempts to evict non-metadata will trigger async prune tasks if the space used by dnodes exceeds this limit. The new zfs_arc_dnode_reduce_percent tunable specifies the amount by which the excess dnode space is attempted to be pruned as a percentage of the amount by which zfs_arc_dnode_limit is being exceeded. By default, it tries to unpin 10% of the dnodes. The problem of dnode metadata pinning was observed with the following testing procedure (in this example, zfs_arc_max is set to 4GiB): - Create a large number of small files until arc_meta_used exceeds arc_meta_limit (3GiB with default tuning) and arc_prune starts increasing. - Create a 3GiB file with dd. Observe arc_mata_used. It will still be around 3GiB. - Repeatedly read the 3GiB file and observe arc_meta_limit as before. It will continue to stay around 3GiB. With this modification, space for the 3GiB file is gradually made available as subsequent demands on the ARC are made. The previous behavior can be restored by setting zfs_arc_dnode_limit to the same value as the zfs_arc_meta_limit. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4345 Issue #4512 Issue #4773 Closes #4858	2016-07-25 15:26:38 -07:00
Tim Chase	e6603b7c1f	Fix sync behavior for disk vdevs Prior to `b39c22b`, which was first generally available in the 0.6.5 release as `b39c22b`, ZoL never actually submitted synchronous read or write requests to the Linux block layer. This means the vdev_disk_dio_is_sync() function had always returned false and, therefore, the completion in dio_request_t.dr_comp was never actually used. In `b39c22b`, synchronous ZIO operations were translated to synchronous BIO requests in vdev_disk_io_start(). The follow-on commits `5592404` and `aa159af` fixed several problems introduced by `b39c22b`. In particular, `5592404` introduced the new flag parameter "wait" to __vdev_disk_physio() but under ZoL, since vdev_disk_physio() is never actually used, the wait flag was always zero so the new code had no effect other than to cause a bug in the use of the dio_request_t.dr_comp which was fixed by `aa159af`. The original rationale for introducing synchronous operations in `b39c22b` was to hurry certains requests through the BIO layer which would have otherwise been subject to its unplug timer which would increase the latency. This behavior of the unplug timer, however, went away during the transition of the plug/unplug system between kernels 2.6.32 and 2.6.39. To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior. For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and ise used for the same purpose. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4858	2016-07-25 14:24:47 -07:00
Brian Behlendorf	273ff9b5cc	Fix uninitialized variable in avl_add() Silence the following warning when compiling with gcc 5.4.0. Specifically gcc (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609. module/avl/avl.c: In function ‘avl_add’: module/avl/avl.c:647:2: warning: ‘where’ may be used uninitialized in this function [-Wmaybe-uninitialized] avl_insert(tree, new_node, where); Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-07-25 14:21:34 -07:00
Nikolay Borisov	2c6abf15ff	Remove znode's z_uid/z_gid member Remove duplicate z_uid/z_gid member which are also held in the generic vfs inode struct. This is done by first removing the members from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID macros to access the respective member from struct inode. In cases where the uid/gids are being marshalled from/to disk, use the newly introduced zfs_(uid\|gid)_(read\|write) functions to properly save the uids rather than the internal kernel representation. Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4685 Issue #227	2016-07-25 13:21:49 -07:00
Tom Caputi	77943bc1dc	Fix for metaslab_fastwrite_unmark() assert failure Currently there is an issue where metaslab_fastwrite_unmark() unmarks fastwrites on vdev_t's that have never had fastwrites marked on them. The 'fastwrite mark' is essentially a count of outstanding bytes that will be written to a vdev and is used in syncing context. The problem stems from the fact that the vdev_pending_fastwrite field is not being transferred over when replacing a top-level vdev. As a result, the metaslab is marked for fastwrite on the old vdev and unmarked on the new one, which brings the fastwrite count below zero. This fix simply assigns vdev_pending_fastwrite from the old vdev to the new one so this count is not lost. Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4267	2016-07-25 13:21:43 -07:00
Tom Caputi	f4bc1bbe11	Fix for compilation error when using the kernel's CONFIG_LOCKDEP Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4329	2016-07-21 18:49:50 -07:00
Tom Caputi	0b04990a5d	Illumos Crypto Port module added to enable native encryption in zfs A port of the Illumos Crypto Framework to a Linux kernel module (found in module/icp). This is needed to do the actual encryption work. We cannot use the Linux kernel's built in crypto api because it is only exported to GPL-licensed modules. Having the ICP also means the crypto code can run on any of the other kernels under OpenZFS. I ended up porting over most of the internals of the framework, which means that porting over other API calls (if we need them) should be fairly easy. Specifically, I have ported over the API functions related to encryption, digests, macs, and crypto templates. The ICP is able to use assembly-accelerated encryption on amd64 machines and AES-NI instructions on Intel chips that support it. There are place-holder directories for similar assembly optimizations for other architectures (although they have not been written). Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4329	2016-07-20 10:43:30 -07:00
Chunwei Chen	be88e733a6	Fix NULL pointer in zfs_preumount from `1d9b3bd` When zfs_domount fails zsb will be freed, and its caller mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into zfs_preumount. In order to make sure we don't touch any nonexistent stuff, we must make sure s_fs_info is NULL in the fail path so zfs_preumount can easily check that. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4867 Issue #4854	2016-07-20 09:05:43 -07:00
Gvozden Neskovic	26a08b5ca9	RAIDZ parity kstat rework Print table with speed of methods for each implementation. Last line describes contents of [fastest] selection. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4860	2016-07-19 16:43:07 -07:00
Gvozden Neskovic	c9187d867f	Fixes and enhancements of SIMD raidz parity - Implementation lock replaced with atomic variable - Trailing whitespace is removed from user specified parameter, to enhance experience when using commands that add newline, e.g. `echo` - raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813 - silence `cppcheck` in vdev_raidz, partial solution of Issue #1392 - Minor fixes and cleanups - Enable use of original parity methods in [fastest] configuration. New opaque original ops structure, representing native methods, is added to supported raidz methods. Original parity methods are executed if selected implementation has NULL fn pointer. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4813 Issue #1392	2016-07-19 16:43:07 -07:00
Chunwei Chen	1d9b3bd8fb	Wait iput_async before evict_inodes to prevent race Wait for iput_async before entering evict_inodes in generic_shutdown_super. The reason we must finish before evict_inodes is when lazytime is on, or when zfs_purgedir calls zfs_zget, iput would bump i_count from 0 to 1. This would race with the i_count check in evict_inodes. This means it could destroy the inode while we are still using it. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4854	2016-07-19 09:23:58 -07:00
Tyler J. Stachecki	3d11ecbddd	Prevent segfaults in SSE optimized Fletcher-4 In some cases, the compiler was not respecting the GNU aligned attribute for stack variables in `35a76a0`. This was resulting in a segfault on CentOS 6.7 hosts using gcc 4.4.7-17. This issue was fixed in gcc 4.6. To prevent this from occurring, use unaligned loads and stores for all stack and global memory references in the SSE optimized Fletcher-4 code. Disable zimport testing against master where this flaw exists: TEST_ZIMPORT_VERSIONS="installed" Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4862	2016-07-19 09:03:44 -07:00
Roman Strashkin	1b87e0f532	Fix filesystem destroy with receive_resume_token It is possible that the given DS may have hidden child (%recv) datasets - "leftovers" resulting from the previously interrupted 'zfs receieve'. Try to remove the hidden child (%recv) and after that try to remove the target dataset. If the hidden child (%recv) does not exist the original error (EEXIST) will be returned. Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4818	2016-07-15 15:34:46 -07:00
Tyler J. Stachecki	35a76a0366	Implementation of SSE optimized Fletcher-4 Builds off of `1eeb4562` (Implementation of AVX2 optimized Fletcher-4) This commit adds another implementation of the Fletcher-4 algorithm. It is automatically selected at module load if it benchmarks higher than all other available implementations. The module benchmark was also amended to analyze the performance of the byteswap-ed version of Fletcher-4, as well as the non-byteswaped version. The average performance of the two is used to select the the fastest implementation available on the host system. Adds a pair of fields to an existing zcommon module parameter: - zfs_fletcher_4_impl (str) "sse2" - new SSE2 implementation if available "ssse3" - new SSSE3 implementation if available Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4789	2016-07-15 10:42:35 -07:00
Chris Dunlop	dfbc86309f	Use native inode->i_nlink instead of znode->z_links A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's 64 bit on-disk link count. We revert "xattr dir doesn't get purged during iput" (`ddae16a`) as this is a more Linux-integrated fix for the same issue. In addition, setting the initial link count on a new node has been changed from setting one less than required in zfs_mknode() then incrementing to the correct count in zfs_link_create() (which was somewhat bizarre in the first place), to setting the correct count in zfs_mknode() and not incrementing it in zfs_link_create(). This both means we no longer set the link count in sa_bulk_update() twice (once for the initial incorrect count then again for the correct count), as well as adhering to the Linux requirement of not incrementing a zero link count without I_LINKABLE (see linux commit f4e0c30c). Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4838 Issue #227	2016-07-14 16:25:34 -07:00
Chunwei Chen	02de3e3c5d	Fix dbuf_stats_hash_table_data race Dropping DBUF_HASH_MUTEX when walking the hash list is unsafe. The dbuf can be freed at any time. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4846	2016-07-14 16:25:04 -07:00
Tim Chase	8887c7d778	Prevent null dereferences when accessing dbuf kstat In arc_buf_info(), the arc_buf_t may have no header. If not, don't try to fetch the arc buffer stats and instead just zero them. The null dereferences were observed while accessing the dbuf kstat with awk on a system in which millions of small files were being created in order to overflow the system's metadata limit. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4837	2016-07-14 16:25:03 -07:00
Gvozden Neskovic	ae25d22235	Add RAID-Z routines for SSE2 instruction set, in x86_64 mode. The patch covers low-end and older x86 CPUs. Parity generation is equivalent to SSSE3 implementation, but reconstruction is somewhat slower. Previous 'sse' implementation is renamed to 'ssse3' to indicate highest instruction set used. Benchmark results: scalar_rec_p 4 720476442 scalar_rec_q 4 187462804 scalar_rec_r 4 138996096 scalar_rec_pq 4 140834951 scalar_rec_pr 4 129332035 scalar_rec_qr 4 81619194 scalar_rec_pqr 4 53376668 sse2_rec_p 4 2427757064 sse2_rec_q 4 747120861 sse2_rec_r 4 499871637 sse2_rec_pq 4 522403710 sse2_rec_pr 4 464632780 sse2_rec_qr 4 319124434 sse2_rec_pqr 4 205794190 ssse3_rec_p 4 2519939444 ssse3_rec_q 4 1003019289 ssse3_rec_r 4 616428767 ssse3_rec_pq 4 706326396 ssse3_rec_pr 4 570493618 ssse3_rec_qr 4 400185250 ssse3_rec_pqr 4 377541245 original_rec_p 4 691658568 original_rec_q 4 195510948 original_rec_r 4 26075538 original_rec_pq 4 103087368 original_rec_pr 4 15767058 original_rec_qr 4 15513175 original_rec_pqr 4 10746357 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4783	2016-07-13 10:24:55 -07:00
Gvozden Neskovic	1bf3bf0e29	Fix handling of errors nvlist in zfs_ioc_recv_new() zfs_ioc_recv_impl() is changed to always allocate the 'errors' nvlist, its callers are responsible for freeing it. Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4829	2016-07-13 10:20:08 -07:00
Peng	81edd3e834	Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z The following scenario can result in garbage in the dn_spill field. The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR is clear to ensure the dn_spill field is cleared. Current txg = A. * A new spill buffer is created. Its dbuf is initialized with db_blkptr = NULL and it's dirtied. Current txg = B. * The spill buffer is modified. It's marked as dirty in this txg. * Additional changes make the spill buffer unnecessary because the xattr fits into the bonus buffer, so it's removed. The dbuf is undirtied in this txg, but it's still referenced and cannot be destroyed. Current txg = C. * Starts syncing of txg A * dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr is NULL, dbuf_check_blkptr() is called. * The dbuf starts being written and it reaches the ready state (not done yet). * A new change makes the spill buffer necessary again. sa_build_layouts() ends up calling dbuf_find() to locate the dbuf. It finds the old dbuf because it has not been destroyed yet (it will be destroyed when the previous write is done and there are no more references). The old dbuf has db_blkptr != NULL. * txg A write is complete and the dbuf released. However it's still referenced, so it's not destroyed. Current txg = D. * Starts syncing of txg B * dbuf_sync_leaf() is called for the bonus buffer. Its contents are directly copied into the dnode, overwriting the blkptr area because, in txg B, the bonus buffer was big enough to hold the entire xattr. * At this point, the db_blkptr of the spill buffer used in txg C gets corrupted. Signed-off-by: Peng <peng.hse@xtaotech.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3937	2016-07-12 16:47:44 -07:00
Chunwei Chen	31b6111fd9	Kill zp->z_xattr_parent to prevent pinning zp->z_xattr_parent will pin the parent. This will cause huge issue when unlink a file with xattr. Because the unlinked file is pinned, it will never get purged immediately. And because of that, the xattr stuff will never be marked as unlinked. So the whole unlinked stuff will stay there until shrink cache or umount. This change partially reverts `e89260a`. This is safe because only the zp->z_xattr_parent optimization is removed, zpl_xattr_security_init() is still called from the zpl outside the inode lock. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue #4359 Issue #3508 Issue #4413 Issue #4827	2016-07-12 14:18:10 -07:00
Chunwei Chen	ddae16a9cf	xattr dir doesn't get purged during iput We need to set inode->i_nlink to zero so iput will purge it. Without this, it will get purged during shrink cache or umount, which would likely result in deadlock due to zfs_zget waiting forever on its children which are in the dispose_list of the same thread. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue #4359 Issue #3508 Issue #4413 Issue #4827	2016-07-12 14:04:30 -07:00
Chunwei Chen	6c2530647c	fh_to_dentry should return ESTALE when generation mismatch When generation mismatch, it usually means the file pointed by the file handle was deleted. We should return ESTALE to indicate this. We return ENOENT in zfs_vget since zpl_fh_to_dentry will convert it to ESTALE. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4828	2016-07-12 13:34:15 -07:00
Chunwei Chen	bffb68a2b8	Fix Large kmem_alloc in vdev_metaslab_init This allocation can go way over 1MB, so we should use vmem_alloc instead of kmem_alloc. Large kmem_alloc(1430784, 0x1000), please file an issue... Call Trace: [<ffffffffa0324aff>] ? spl_kmem_zalloc+0xef/0x160 [spl] [<ffffffffa17d0c8d>] ? vdev_metaslab_init+0x9d/0x1f0 [zfs] [<ffffffffa17d46d0>] ? vdev_load+0xc0/0xd0 [zfs] [<ffffffffa17d4643>] ? vdev_load+0x33/0xd0 [zfs] [<ffffffffa17c0004>] ? spa_load+0xfc4/0x1b60 [zfs] [<ffffffffa17c1838>] ? spa_tryimport+0x98/0x430 [zfs] [<ffffffffa17f28b1>] ? zfs_ioc_pool_tryimport+0x41/0x80 [zfs] [<ffffffffa17f5669>] ? zfsdev_ioctl+0x4a9/0x4e0 [zfs] [<ffffffff811bacdf>] ? do_vfs_ioctl+0x2cf/0x4b0 [<ffffffff811baf41>] ? SyS_ioctl+0x81/0xa0 Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4752	2016-07-12 13:34:15 -07:00
Chunwei Chen	7938c2aca7	Don't allow accessing XATTR via export handle Allow accessing XATTR through export handle is a very bad idea. It would allow user to write whatever they want in fields where they otherwise could not. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4828	2016-07-12 13:34:14 -07:00
Chunwei Chen	061460dfe2	Fix get_zfs_sb race with concurrent umount Certain ioctl operations will call get_zfs_sb, which will holds an active count on sb without checking whether it's active or not. This will result in use-after-free. We fix this by using atomic_inc_not_zero to make sure we got an active sb. P1 P2 --- --- deactivate_locked_super(): s_active = 0 zfs_sb_hold() ->get_zfs_sb(): s_active = 1 ->zpl_kill_sb() -->zpl_put_super() --->zfs_umount() ---->zfs_sb_free(zsb) zfs_sb_rele(zsb) Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-07-12 13:34:14 -07:00
Gvozden Neskovic	590c9a0994	Allow building with `CFLAGS="-O0"` If compiled with -O0, gcc doesn't do any stack frame coalescing and -Wframe-larger-than=1024 is triggered in debug mode. Starting with gcc 4.8, new opt level -Og is introduced for debugging, which does not trigger this warning. Fix bench zio size, using SPA_OLD_MAXBLOCKSHIFT Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4799	2016-07-11 16:53:02 -07:00

1 2 3 4 5 ...

1318 Commits