mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Giuseppe Di Natale	8d7f17798d	Linux 4.16 compat: get_disk_and_module() As of https://github.com/torvalds/linux/commit/fb6d47a, get_disk() is now get_disk_and_module(). Add a configure check to determine if we need to use get_disk_and_module(). Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #7264	2018-03-14 16:10:38 -07:00
Tony Hutter	6dc40e2ada	Change checksum & IO delay ratelimit values Change checksum & IO delay ratelimit thresholds from 5/sec to 20/sec. This allows zed to actually trigger if a bunch of these events arrive in a short period of time (zed has a threshold of 10 events in 10 sec). Previously, if you had, say, 100 checksum errors in 1 sec, it would get ratelimited to 5/sec which wouldn't trigger zed to fault the drive. Also, convert the checksum and IO delay thresholds to module params for easy testing. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #7252	2018-03-14 16:10:38 -07:00
chrisrd	792f88131c	Increment zil_itx_needcopy_bytes properly In zil_lwb_commit() with TX_WRITE, we copy the log write record (lrw) into the log write block (lwb) and send it off using zil_lwb_add_txg(). If we also have WR_NEED_COPY, we additionally copy the lwr's data into the lwb to be sent off. If the lwr + data doesn't fit into the lwb, we send the lrw and as much data as will fit (dnow bytes), then go back and do the same with the remaining data. Each time through this loop we're sending dnow data bytes. I.e. zil_itx_needcopy_bytes should be incremented by dnow. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Closes #6988 Closes #7176	2018-03-14 16:10:38 -07:00
chrisrd	338523dd6e	Fix free memory calculation on v3.14+ Provide infrastructure to auto-configure to enum and API changes in the global page stats used for our free memory calculations. arc_free_memory has been broken since an API change in Linux v3.14: 2016-07-28 v4.8 599d0c95 mm, vmscan: move LRU lists to node 2016-07-28 v4.8 75ef7184 mm, vmstat: add infrastructure for per-node vmstats These commits moved some of global_page_state() into global_node_page_state(). The API change was particularly egregious as, instead of breaking the old code, it silently did the wrong thing and we continued using global_page_state() where we should have been using global_node_page_state(), thus indexing into the wrong array via NR_SLAB_RECLAIMABLE et al. There have been further API changes along the way: 2017-07-06 v4.13 385386cf mm: vmstat: move slab statistics from zone to node counters 2017-09-06 v4.14 c41f012a mm: rename global_page_state to global_zone_page_state ...and various (incomplete, as it turns out) attempts to accomodate these changes in ZoL: 2017-08-24 `2209e409` Linux 4.8+ compatibility fix for vm stats 2017-09-16 `787acae0` Linux 3.14 compat: IO acct, global_page_state, etc 2017-09-19 `661907e6` Linux 4.14 compat: IO acct, global_page_state, etc The config infrastructure provided here resolves these issues going back to the original API change in v3.14 and is robust against further Linux changes in this area. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Closes #7170	2018-03-14 16:10:37 -07:00
Olaf Faaland	2644784f49	Report duration and error in mmp_history entries After an MMP write completes, update the relevant mmp_history entry with the time between submission and completion, and the error status of the write. [faaland1@toss3a zfs]$ cat /proc/spl/kstat/zfs/pool/multihost 39 0 0x01 100 8800 69147946270893 72723903122926 id txg timestamp error duration mmp_delay vdev_guid 10607 1166 1518985089 0 138301 637785455 4882... 10608 1166 1518985089 0 136154 635407747 1151... 10609 1166 1518985089 0 803618560 633048078 9740... 10610 1166 1518985090 0 144826 633048078 4882... 10611 1166 1518985090 0 164527 666187671 1151... Where duration = gethrtime_in_done_fn - gethrtime_at_submission, and error = zio->io_error. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7190	2018-03-14 16:10:37 -07:00
Olaf Faaland	b1f61f05b4	Do not initiate MMP writes while pool is suspended While the pool is suspended on host A, it may be imported on host B. If host A continued to write MMP blocks, it would be blindly overwriting MMP blocks written by host B, and the blocks written by host A would have outdated txg information. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7182	2018-03-14 16:10:37 -07:00
George Wilson	07ce5d7390	OpenZFS 8857 - zio_remove_child() panic due to already destroyed parent zio PROBLEM ======= It's possible for a parent zio to complete even though it has children which have not completed. This can result in the following panic: > $C ffffff01809128c0 vpanic() ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80) ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80) ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0, ffffff3373370908) ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0) ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0) ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140) ffffff0180912b40 thread_start+8() > ::status debugging crash dump vmcore.2 (64-bit) from batfs0390 operating system: 5.11 joyent_20170911T171900Z (i86pc) image uuid: (not set) panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80 owner=ffffff3c59b39480 thread=ffffff0180912c40 dump content: kernel pages only The problem is that dbuf_prefetch along with l2arc can create a zio tree which confuses the parent zio and allows it to complete with while children still exist. Here's the scenario: zio tree: pio \|--- lio The parent zio, pio, has entered the zio_done stage and begins to check its children to see there are still some that have not completed. In zio_done(), the children are checked in the following order: zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE) If pio, finds any child which has not completed then it stops executing and goes to sleep. Each call to zio_wait_for_children() will grab the io_lock while checking the particular child. In this scenario, the pio has completed the first call to zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since the only zio in the zio tree right now is the logical zio, lio, then it completes that call and prepares to check the next child type. In the meantime, the lio completes and in its callback creates a child vdev zio, cio. The zio tree looks like this: zio tree: pio \|--- lio \|--- cio The lio then grabs the parent's io_lock and removes itself. zio tree: pio \|--- cio The pio continues to run but has already completed its check for ZIO_CHILD_VDEV and will erroneously complete. When the child zio, cio, completes it will panic the system trying to reference the parent zio which has been destroyed. SOLUTION ======== The fix is to rework the zio_wait_for_children() logic to accept a bitfield for all the children types that it's interested in checking. The io_lock will is held the entire time we check all the children types. Since the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided to allow for the conversion between a ZIO_CHILD type and the bitfield used by the zio_wiat_for_children logic. Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Youzhong Yang <youzhong@gmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8857 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/862ff6d99c Issue #5918 Closes #7168	2018-03-14 16:10:37 -07:00
sanjeevbagewadi	d85011ed69	mmp should use a fixed tag for spa_config locks mmp_write_uberblock() and mmp_write_done() should the same tag for spa_config_locks. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com> Closes #6530 Closes #7155	2018-03-14 16:10:37 -07:00
sanjeevbagewadi	b3da003ebf	Handle zap_add() failures in mixed case mode With "casesensitivity=mixed", zap_add() could fail when the number of files/directories with the same name (varying in case) exceed the capacity of the leaf node of a Fatzap. This results in a ASSERT() failure as zfs_link_create() does not expect zap_add() to fail. The fix is to handle these failures and rollback the transactions. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com> Closes #7011 Closes #7054	2018-03-14 16:10:37 -07:00
Chunwei Chen	5e566c5772	Fix zle_decompress out of bound access Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7099	2018-03-14 16:10:36 -07:00
Chunwei Chen	23227313a2	Fix zdb -c traverse stop on damaged objset root If a corruption happens to be on a root block of an objset, zdb -c will not correctly report the error, and it will not traverse the datasets that come after. This is because traverse_visitbp, which does the callback and reset error for TRAVERSE_HARD, is skipped when traversing zil is failed in traverse_impl. Here's example of what 'zdb -eLcc' command looks like on a pool with damaged objset root: == before patch: Traversing all blocks to verify checksums ... Error counts: errno count block traversal size 379392 != alloc 33987072 (unreachable 33607680) bp count: 172 ganged count: 0 bp logical: 1678336 avg: 9757 bp physical: 130560 avg: 759 compression: 12.85 bp allocated: 379392 avg: 2205 compression: 4.42 bp deduped: 0 ref>1: 0 deduplication: 1.00 SPA allocated: 33987072 used: 0.80% additional, non-pointer bps of type 0: 71 Dittoed blocks on same vdev: 101 == after patch: Traversing all blocks to verify checksums ... zdb_blkptr_cb: Got error 52 reading <54, 0, -1, 0> -- skipping Error counts: errno count 52 1 block traversal size 33963520 != alloc 33987072 (unreachable 23552) bp count: 447 ganged count: 0 bp logical: 36093440 avg: 80745 bp physical: 33699840 avg: 75391 compression: 1.07 bp allocated: 33963520 avg: 75981 compression: 1.06 bp deduped: 0 ref>1: 0 deduplication: 1.00 SPA allocated: 33987072 used: 0.80% additional, non-pointer bps of type 0: 76 Dittoed blocks on same vdev: 115 == Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7099	2018-03-14 16:10:36 -07:00
Brian Behlendorf	310e63dfd1	Linux 4.16 compat: inode_set_iversion() A new interface was added to manipulate the version field of an inode. Add a inode_set_iversion() wrapper for older kernels and use the new interface when available. The i_version field was dropped from the trace point due to the switch to an atomic64_t i_version type. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7148	2018-03-14 16:10:36 -07:00
WHR	a196b3bc3d	OpenZFS 8966 - Source file zfs_acl.c, function zfs_aclset_common contains a use after end of the lifetime of a local variable Authored by: WHR <msl0000023508@gmail.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8966 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c95549fcdc Closes #7141	2018-03-14 16:10:36 -07:00
Richard Elling	a58e1284d8	Remove deprecated zfs_arc_p_aggressive_disable zfs_arc_p_aggressive_disable is no more. This PR removes docs and module parameters for zfs_arc_p_aggressive_disable. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Richard Elling <Richard.Elling@RichardElling.com> Closes #7135	2018-03-14 16:10:36 -07:00
wli5	5f38142e7b	Bug fix in qat_compress.c for vmalloc addr check Remove the unused vmalloc address check, and function mem_to_page will handle the non-vmalloc address when map it to a physical address. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Weigang Li <weigang.li@intel.com> Closes #7125	2018-03-14 16:10:36 -07:00
John L. Hammond	ecc972c7f0	Emit an error message before MMP suspends pool In mmp_thread(), emit an MMP specific error message before calling zio_suspend() so that the administrator will understand why the pool is being suspended. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: John L. Hammond <john.hammond@intel.com> Closes #7048	2018-03-14 16:10:36 -07:00
Brian Behlendorf	5d62588032	Remove vn_rename and vn_remove dependency The only place vn_rename and vn_remove are used is when writing out an updated pool configuration file. By truncating the file instead of renaming and removing it we can avoid having to implement these interfaces entirely. Functionally an empty cache file is treated the same as a missing cache file. This is particularly advantageous because the Linux kernel has never provided a way to reliably implement vn_rename and vn_remove. The cachefile_004_pos.ksh test case was updated to understand that an empty cache file is the same as a missing one. The zfs-import-* systemd service files were not updated to use ConditionFileNotEmpty in place of ConditionPathExists. This means that after exporting all pools and rebooting new pools will not the scanned for on the next boot. This small change should not impact normal usage since pools are not exported as part of a normal shutdown. Documentation was updated accordingly. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes zfsonlinux/spl#648 Closes #6753	2018-03-14 16:10:36 -07:00
Alexander Motin	701ebd014a	OpenZFS 8835 - Speculative prefetch in ZFS not working for misaligned reads In case of misaligned I/O sequential requests are not detected as such due to overlaps in logical block sequence: dmu_zfetch(fffff80198dd0ae0, 27347, 9, 1) dmu_zfetch(fffff80198dd0ae0, 27355, 9, 1) dmu_zfetch(fffff80198dd0ae0, 27363, 9, 1) dmu_zfetch(fffff80198dd0ae0, 27371, 9, 1) dmu_zfetch(fffff80198dd0ae0, 27379, 9, 1) dmu_zfetch(fffff80198dd0ae0, 27387, 9, 1) This patch makes single block overlap to be counted as a stream hit, improving performance up to several times. Authored by: Alexander Motin <mav@FreeBSD.org> Approved by: Gordon Ross <gwr@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed by: Gvozden Neskovic <neskovic@gmail.com> Reviewed by: George Melikov <mail@gmelikov.ru> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8835 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aab6dd482a Closes #7062	2018-01-30 10:27:31 -06:00
Alex Zhuravlev	129e3e8dc3	Use zap_count instead of cached z_size for unlink As a performance optimization Lustre does not strictly update the SA_ZPL_SIZE when adding/removing from non-directory entries. This results in entries which cannot be removed through the ZPL layer even though the ZAP is empty and safe to remove. Resolve this issue by checking the zap_count() directly instead on relying on the cached SA_ZPL_SIZE. Micro-benchmarks show no significant performance impact due to the additional overhead of using zap_count(). Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7019	2018-01-30 10:27:31 -06:00
Brian Behlendorf	9c1a8eaa51	Fix ARC hit rate When the compressed ARC feature was added in commit `d3c2ae1` the method of reference counting in the ARC was modified. As part of this accounting change the arc_buf_add_ref() function was removed entirely. This would have be fine but the arc_buf_add_ref() function served a second undocumented purpose of updating the ARC access information when taking a hold on a dbuf. Without this logic in place a cached dbuf would not migrate its associated arc_buf_hdr_t to the MFU list. This would negatively impact the ARC hit rate, particularly on systems with a small ARC. This change reinstates the missing call to arc_access() from dbuf_hold() by implementing a new arc_buf_access() function. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6171 Closes #6852 Closes #6989	2018-01-30 10:27:31 -06:00
lidongyang	8d82a19def	Call commit callbacks from the tail of the list Our zfs backed Lustre MDT had soft lockups while under heavy metadata workloads while handling transaction callbacks from osd_zfs. The problem is zfs is not taking advantage of the fast path in Lustre's trans callback handling, where Lustre will skip the calls to ptlrpc_commit_replies() when it already saw a higher transaction number. This patch corrects this, it also has a positive impact on metadata performance on Lustre with osd_zfs, plus some cleanup in the headers. A similar issue for ext4/ldiskfs is described on: https://jira.hpdd.intel.com/browse/LU-6527 Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au> Closes #6986	2018-01-30 10:27:31 -06:00
Brian Behlendorf	aebc5df418	Update for cppcheck v1.80 Resolve new warnings and errors from cppcheck v1.80. * [lib/libshare/libshare.c:543]: (warning) Possible null pointer dereference: protocol * [lib/libzfs/libzfs_dataset.c:2323]: (warning) Possible null pointer dereference: srctype * [lib/libzfs/libzfs_import.c:318]: (error) Uninitialized variable: link * [module/zfs/abd.c:353]: (error) Uninitialized variable: sg * [module/zfs/abd.c:353]: (error) Uninitialized variable: i * [module/zfs/abd.c:385]: (error) Uninitialized variable: sg * [module/zfs/abd.c:385]: (error) Uninitialized variable: i * [module/zfs/abd.c:553]: (error) Uninitialized variable: i * [module/zfs/abd.c:553]: (error) Uninitialized variable: sg * [module/zfs/abd.c:763]: (error) Uninitialized variable: i * [module/zfs/abd.c:763]: (error) Uninitialized variable: sg * [module/zfs/abd.c:305]: (error) Uninitialized variable: tmp_page * [module/zfs/zpl_xattr.c:342]: (warning) Possible null pointer dereference: value * [module/zfs/zvol.c:208]: (error) Uninitialized variable: p Convert the following suppression to inline. * [module/zfs/zfs_vnops.c:840]: (error) Possible null pointer dereference: aiov Exclude HAVE_UIO_ZEROCOPY and HAVE_DNLC from analysis since these macro's will never be defined until this functionality is implemented. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6879	2018-01-30 10:27:31 -06:00
Gvozden Neskovic	a94447ddf3	dmu_objset: release bonus buffer in failure path Reported by kmemleak during testing of a new patch: ``` unreferenced object 0xffff9f1c12e38800 (size 1024): comm "z_upgrade", pid 17842, jiffies 4296870904 (age 8746.268s) backtrace: kmemleak_alloc+0x7a/0x100 __kmalloc_node+0x26c/0x510 range_tree_create+0x39/0xa0 [zfs] dmu_zfetch_init+0x73/0xe0 [zfs] dnode_create+0x12c/0x3b0 [zfs] dnode_hold_impl+0x1096/0x1130 [zfs] dnode_hold+0x23/0x30 [zfs] dmu_bonus_hold_impl+0x6b/0x370 [zfs] dmu_bonus_hold+0x1e/0x30 [zfs] dmu_objset_space_upgrade+0x114/0x310 [zfs] dmu_objset_userobjspace_upgrade_cb+0xd8/0x150 [zfs] dmu_objset_upgrade_task_cb+0x136/0x1e0 [zfs] kthread+0x119/0x150 ``` Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #6575	2018-01-30 10:27:30 -06:00
Chunwei Chen	7192ec7942	Fix zfs_ioc_pool_sync should not use fnvlist Use fnvlist on user input would allow user to easily panic zfs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #6529	2018-01-30 10:27:30 -06:00
Gvozden Neskovic	06acbbc429	vdev_mirror: load balancing fixes vdev_queue: - Track the last position of each vdev, including the io size, in order to detect linear access of the following zio. - Remove duplicate `vq_lastoffset` vdev_mirror: - Correctly calculate the zio offset (signedness issue) - Deprecate `vdev_queue_register_lastoffset()` - Add `VDEV_LABEL_START_SIZE` to zio offset of leaf vdevs Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Closes #6461	2018-01-30 10:27:30 -06:00
Brian Behlendorf	504bfc8b49	Fix multihost stale cache file import When the multihost property is enabled it should be impossible to import an active pool even using the force (-f) option. This patch prevents a forced import from succeeding when importing with a stale cache file. The root cause of the problem is that the kernel modules trusted the hostid provided in configuration. This is always correct when the configuration is generated by scanning for the pool. However, when using an existing cache file the hostid could be stale which would result in the activity check being skipped. Resolve the issue by always using the hostid read from the label configuration where the best uberblock was found. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6933 Closes #6971	2017-12-18 10:31:01 -08:00
Tony Hutter	36e0ddb744	Revert "Long hold the dataset during upgrade" This reverts commit `a5c8119eba`. The commit (which was modified to remove encryption) was hitting ASSERT(dsl_pool_config_held(dmu_objset_pool(os))) in dmu_objset_upgrade() during automated testing. Signed-off-by: Tony Hutter <hutter2@llnl.gov>	2017-12-06 13:25:40 -06:00
Brian Behlendorf	1030f807ba	Fix NFS sticky bit permission denied error When zfs_sticky_remove_access() was originally adapted for Linux a typo was made which altered the intended behavior. As described in the block comment, the intended behavior is that permission should be granted when the entry is a regular file and you have write access. That is, S_ISREG should have been used instead of S_ISDIR. Restricting permission to regular files made good sense for older systems where setting the bit on executable files would instruct the system to save the program's text segment on the swap device. On modern systems this behavior has been replaced by the sticky bit acting as a restricted deletion flag and the plain file restriction has been relaxed. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6889 Closes #6910	2017-12-04 17:22:47 -08:00
Brian Behlendorf	4a98780933	Preserve itx alloc size for zio_data_buf_free() Using zio_data_buf_alloc() to allocate the itx's may be unsafe because the itx->itx_lr.lrc_reclen field is not constant from allocation to free. Using a different itx->itx_lr.lrc_reclen size in zio_data_buf_free() can result in the allocation being returned to the wrong kmem cache. This issue can be avoided entirely by storing the allocation size in itx->itx_size and using that for zio_data_buf_free(). Reviewed by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6912	2017-12-04 17:21:39 -08:00
LOLi	6db8f1a0d1	Fix 'zfs get {user\|group}objused@' functionality Fix a regression accidentally introduced in `1b81ab4` that prevents 'zfs get {user\|group}objused@' from correctly reporting the requested value. Update "userspace_003_pos.ksh" and "groupspace_003_pos.ksh" to verify this functionality. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6908	2017-12-04 17:21:39 -08:00
Mark Wright	e06711412b	Linux 4.14 compat: CONFIG_GCC_PLUGIN_RANDSTRUCT Fix build errors with gcc 7.2.0 on Gentoo with kernel 4.14 built with CONFIG_GCC_PLUGIN_RANDSTRUCT=y such as: module/nvpair/nvpair.c:2810:2:error: positional initialization of field in ?struct? declared with 'designated_init' attribute [-Werror=designated-init] nvs_native_nvlist, ^~~~~~~~~~~~~~~~~ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mark Wright <gienah@gentoo.org> Closes #5390 Closes #6903	2017-12-04 17:21:39 -08:00
Brian Behlendorf	954516cec1	Emit history events for 'zpool create' History commands and events were being suppressed for the 'zpool create' command since the history object did not yet exist. Create the object earlier so this history doesn't get lost. Split the pool_destroy event in to pool_destroy and pool_export so they may be distinguished. Updated events_001_pos and events_002_pos test cases. They now check for the expected history events and were reworked to be more reliable. Reviewed-by: Nathaniel Clark <nathaniel.l.clark@intel.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6712 Closes #6486 Conflicts: tests/zfs-tests/tests/functional/events/events_002_pos.ksh	2017-12-04 17:21:03 -08:00
Brian Behlendorf	841cb5ee2a	Fix dirty check in dmu_offset_next() The correct way to determine if a dnode is dirty is to check if any of the dn->dn_dirty_link's are active. Relying solely on the dn->dn_dirtyctx can result in the dnode being mistakenly reported as clean. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3125 Closes #6867	2017-11-21 13:11:29 -06:00
LOLi	fedc1d96a8	Fix truncate(2) mtime and ctime handling On Linux, ftruncate(2) always changes the file timestamps, even if the file size is not changed. However, in case of a successfull truncate(2), the timestamps are updated only if the file size changes. This translates to the VFS calling the ZFS Posix Layer "setattr" function (zpl_setattr) with ATTR_MTIME and ATTR_CTIME unconditionally set on the iattr mask only when doing a ftruncate(2), while the truncate(2) is left to the filesystem implementation to be dealt with. This behaviour is consistent with POSIX:2004/SUSv3 specifications where there's no explicit requirement for file size changes to update the timestamps only for ftruncate(2): http://pubs.opengroup.org/onlinepubs/009695399/functions/truncate.html http://pubs.opengroup.org/onlinepubs/009695399/functions/ftruncate.html This has been later updated in POSIX:2008/SUSv4 where, for both truncate(2)/ftruncate(2), there's no mention of this size change requirement: http://austingroupbugs.net/view.php?id=489 http://pubs.opengroup.org/onlinepubs/9699919799/functions/truncate.html http://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html Unfortunately the Linux VFS is still calling into the ZPL without ATTR_MTIME/ATTR_CTIME set in the truncate(2) case: we fix this by explicitly updating the timestamps when detecting the ATTR_SIZE bit, which is always set in do_truncate(), on the iattr mask. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6811 Closes #6819	2017-11-21 13:11:29 -06:00
benrubson	59511072b4	OpenZFS 7531 - Assign correct flags to prefetched buffers Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Authored by: abraunegg <alex.braunegg@gmail.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7531 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/468008cb	2017-11-21 13:11:29 -06:00
Arkadiusz Bubała	a5c8119eba	Long hold the dataset during upgrade If the receive or rollback is performed while filesystem is upgrading the objset may be evicted in `dsl_dataset_clone_swap_sync_impl`. This will lead to NULL pointer dereference when upgrade tries to access evicted objset. This commit adds long hold of dataset during whole upgrade process. The receive and rollback will return an EBUSY error until the upgrade is not finished. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com> Closes #5295 Closes #6837	2017-11-21 13:03:21 -06:00
Tim Chase	d7881a6dca	Handle compressed buffers in __dbuf_hold_impl() In __dbuf_hold_impl(), if a buffer is currently syncing and is still referenced from db_data, a copy is made in case it is dirtied again in the txg. Previously, the buffer for the copy was simply allocated with arc_alloc_buf() which doesn't handle compressed or encrypted buffers (which are a special case of a compressed buffer). The result was typically an invalid memory access because the newly-allocated buffer was of the uncompressed size. This commit fixes the problem by handling the 2 compressed cases, encrypted and unencrypted, respectively, with arc_alloc_raw_buf() and arc_alloc_compressed_buf(). Although using the proper allocation functions fixes the invalid memory access by allocating a buffer of the compressed size, another unrelated issue made it impossible to properly detect compressed buffers in the first place. The header's compression flag was set to ZIO_COMPRESS_OFF in arc_write() when it was possible that an attached buffer was actually compressed. This commit adds logic to only set ZIO_COMPRESS_OFF in the non-ZIO_RAW case which wil handle both cases of compressed buffers (encrypted or unencrypted). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #5742 Closes #6797	2017-11-21 13:01:30 -06:00
wli5	9add19b37d	Bug fix in qat_compress.c when compressed size is < 4KB When the 128KB block is compressed to less than 4KB, the pointer to the Footer is not in the end of the compressed buffer, that's because the Header offset was added twice for this case. So there is a gap between the Footer and the compressed buffer. 1. Always compute the Footer pointer address from the start of the last page. 2. Remove the un-used workaroud code which has been verified fixed with the latest driver and this fix. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Weigang Li <weigang.li@intel.com> Closes #6827	2017-11-20 16:48:26 -06:00
wli5	318fdeb51f	Support integration with new QAT products Support integration with new QAT products: Intel(R) C62x Chipset, or Atom(R) C3000 Processor Product Family SoC: 1. Detect new file name in auto-conf. 2. Change MAX_INSTANCES to 48. 3. Change "num_inst" to U16 to clean a build warning. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Weigang Li <weigang.li@intel.com> Closes #6767	2017-11-20 16:19:23 -06:00
Olaf Faaland	d3d20bf442	Reimplement vdev_random_leaf and rename it Rename it as mmp_random_leaf() since it is defined in mmp.c. The earlier implementation could end up spinning forever if a pool had a vdev marked writeable, none of whose children were writeable. It also did not guarantee that if a writeable leaf vdev existed, it would be found. Reimplement to recursively walk the device tree to select the leaf. It searches the entire tree, so that a return value of (NULL) indicates there were no usable leaves in the pool; all were either not writeable or had pending mmp writes. It still chooses the starting child randomly at each level of the tree, so if the pool's devices are healthy, the mmp writes go to random leaves with an even distribution. This was verified by testing using zfs_multihost_history enabled. Reviewed by: Thomas Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #6631 Closes #6665	2017-11-20 16:19:23 -06:00
Fabian Grünbichler	8d688ce66a	Skip FREEOBJECTS for objects which can't exist When sending an incremental stream based on a snapshot, the receiving side must have the same base snapshot. Thus we do not need to send FREEOBJECTS records for any objects past the maximum one which exists locally. This allows us to send incremental streams (again) to older ZFS implementations (e.g. ZoL < 0.7) which actually try to free all objects in a FREEOBJECTS record, instead of bailing out early. Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Closes #5699 Closes #6507 Closes #6616	2017-10-16 10:57:55 -07:00
Fabian Grünbichler	b544fe4123	Free objects when receiving full stream as clone All objects after the last written or freed object are not supposed to exist after receiving the stream. Free them accordingly, as if a freeobjects record for them had been included in the stream. Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Closes #5699 Closes #6507 Closes #6616	2017-10-16 10:57:55 -07:00
Brian Behlendorf	91b2f6ab1c	Fix ARC behavior on 32-bit systems With the addition of the ABD changes consumption of the virtual address space has been greatly reduced. This exposed an issue on CONFIG_HIGHMEM systems where free memory was being calculated incorrectly. Functionally this didn't cause any major problems prior to ABD because a lack of available virtual address space was used as an indicator of low memory. This patch makes the following changes to address the issue and in the process realigns the code further with OpenZFS. There are no substantive changes in behavior for 64-bit systems. * Added CONFIG_HIGHMEM case to the arc_all_memory() and arc_free_memory() functions to only consider low memory pages on CONFIG_HIGHMEM systems. * The arc_free_memory() function was updated to return bytes instead of pages to be consistent with the other helper functions. In user space we make up some reasonable values since currently only testing is performed in this context. * Adds three new values to the arcstats kstat to provide visibility in to the ARC's assessment of the memory situation: memory_all_bytes, memory_free_bytes, and memory_available_bytes. * Added kmem_reap() call to arc_available_memory() for 32-bit builds to realign code with OpenZFS. * Reduced size of test file in /async_destroy_001_pos.ksh to speed up test case. Multiple txgs are still required. * Move vdevs used by zpool_clear_001_pos and zpool_upgrade_002_pos to TEST_BASE_DIR location to speed up test cases. Reviewed-by: David Quigley <david.quigley@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5352 Closes #6734	2017-10-16 10:57:55 -07:00
Tobin Harding	80cc2f6111	Remove unnecessary equality check Currently `if` statement includes an assignment (from a function return value) and a equality check. The parenthesis are in the incorrect place, currently the code clobbers the function return value because of this. We can fix this by simplifying the `if` statement. `if (foo != 0)` can be more succinctly expressed as `if (foo)` Remove the equality check, add parenthesis to correct the statement. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Tobin C. Harding <me@tobin.cc> Closes #6685 Close #6719	2017-10-16 10:57:55 -07:00
Isaac Huang	b97948276d	Use linear abd in vdev_copy_uberblocks() The vdev_copy_uberblocks() function should use abd_alloc_linear() to allocate ub_abd, because abd_to_buf(ub_abd)) is used later. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Isaac Huang <he.huang@intel.com> Closes #6718 Closes #6713	2017-10-16 10:57:55 -07:00
Ned Bass	4cfc086e4d	receive_freeobjects() skips freeing some objects When receiving a FREEOBJECTS record, receive_freeobjects() incorrectly skips a freed object in some cases. Specifically, this happens when the first object in the range to be freed doesn't exist, but the second object does. This leaves an object allocated on disk on the receiving side which is unallocated on the sending side, which may cause receiving subsequent incremental streams to fail. The bug was caused by an incorrect increment of the object index variable when current object being freed doesn't exist. The increment is incorrect because incrementing the object index is handled by a call to dmu_object_next() in the increment portion of the for loop statement. Add test case that exposes this bug. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #6694 Closes #6695	2017-10-16 10:57:55 -07:00
chrisrd	25d232f407	Scale the dbuf cache with arc_c Commit `d3c2ae1` introduced a dbuf cache with a default size of the minimum of 100M or 1/32 maximum ARC size. (These figures may be adjusted using dbuf_cache_max_bytes and dbuf_cache_max_shift.) The dbuf cache is counted as metadata for the purposes of ARC size calculations. On a 1GB box the ARC maximum size defaults to c_max 493M which gives a dbuf cache default minimum size of 15.4M, and the ARC metadata defaults to minimum 16M. I.e. the dbuf cache is an significant proportion of the minimum metadata size. With other overheads involved this actually means the ARC metadata doesn't get down to the minimum. This patch dynamically scales the dbuf cache to the target ARC size instead of statically scaling it to the maximum ARC size. (The scale is still set by dbuf_cache_max_shift and the maximum size is still fixed by dbuf_cache_max_bytes.) Using the target ARC size rather than the current ARC size is done to help the ARC reach the target rather than simply focusing on the current size. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Issue #6506 Closes #6561	2017-10-16 10:57:54 -07:00
Giuseppe Di Natale	bef6a8bc3a	Correct cppcheck errors (#6662 ) ZFS buildbot STYLE builder was moved to Ubuntu 17.04 which has a newer version of cppcheck. Handle the new cppcheck errors. uu_* functions removed in this commit were unused and effectively dead code. They are now retired. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #6653	2017-09-20 12:59:21 -07:00
Brian Behlendorf	266b181e75	Increase default arc_c_min Increase the default arc_c_min value to which whichever is larger, either 32M or 1/32 of total system memory. This is advantageous for systems with more than 1G of memory where performance issues may occur when the ARC is allowed to collapse below a minimum size. At the same time we want to use the bare minimum value which is still functional so the filesystem can be used in very low memory environments. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6659	2017-09-20 10:25:54 -07:00
Brian Behlendorf	c474f5e9a7	Export symbol dmu_tx_mark_netfree() This symbol is needed by Lustre for the same reason it was needed by the ZPL. It should have been exported when the original patch was merged. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Alex Zhuravlev <bzzz@whamcloud.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6660	2017-09-20 10:25:54 -07:00
Brian Behlendorf	661907e6bc	Linux 4.14 compat: IO acct, global_page_state, etc (#6655 ) generic_start_io_acct/generic_end_io_acct in the master branch of the linux kernel requires that the request_queue be provided. Move the logic from freemem in the spl to arc_free_memory in arc.c. Do this so we can take advantage of global_page_state interface checks in zfs. Upstream kernel replaced struct block_device with struct gendisk in struct bio. Determine if the function bio_set_dev exists during configure and have zfs use that if it exists. bio_set_dev https://github.com/torvalds/linux/commit/74d4699 global_node_page_state https://github.com/torvalds/linux/commit/75ef718 io acct https://github.com/torvalds/linux/commit/d62e26b Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #6635 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2017-09-19 14:24:34 -07:00
Gaurav Kumar	d3e7d981d4	Modifying XATTRs doesnt change the ctime Changing any metadata, should modify the ctime. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: gaurkuma <gauravk.18@gmail.com> Closes #3644 Closes #6586	2017-09-13 16:05:18 -07:00
Brian Behlendorf	a2a0440918	Fix volume WR_INDIRECT log replay (#6620 ) The portion of the zvol_replay_write() handler responsible for replaying indirect log records for some reason never existed. As a result indirect log records were not being correctly replayed. This went largely unnoticed since the majority of zvol log records were of the type WR_COPIED or WR_NEED_COPY prior to OpenZFS 7578. This patch updates zvol_replay_write() to correctly handle these log records and adds a new test case which verifies volume replay to prevent any regression. The existing test case which verified replay on filesystem was renamed slog_replay_fs.ksh for clarity. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6603	2017-09-13 16:04:16 -07:00
Giuseppe Di Natale	45d1abc74d	Improved dnode allocation and dmu_hold_impl() (#6611 ) Refactor dmu_object_alloc_dnsize() and dnode_hold_impl() to simplify the code, fix errors introduced by commit `dbeb879` (PR #6117) interacting badly with large dnodes, and improve performance. * When allocating a new dnode in dmu_object_alloc_dnsize(), update the percpu object ID for the core's metadnode chunk immediately. This eliminates most lock contention when taking the hold and creating the dnode. * Correct detection of the chunk boundary to work properly with large dnodes. * Separate the dmu_hold_impl() code for the FREE case from the code for the ALLOCATED case to make it easier to read. * Fully populate the dnode handle array immediately after reading a block of the metadnode from disk. Subsequently the dnode handle array provides enough information to determine which dnode slots are in use and which are free. * Add several kstats to allow the behavior of the code to be examined. * Verify dnode packing in large_dnode_008_pos.ksh. Since the test is purely creates, it should leave very few holes in the metadnode. * Add test large_dnode_009_pos.ksh, which performs concurrent creates and deletes, to complement existing test which does only creates. With the above fixes, there is very little contention in a test of about 200,000 racing dnode allocations produced by tests 'large_dnode_008_pos' and 'large_dnode_009_pos'. name type data dnode_hold_dbuf_hold 4 0 dnode_hold_dbuf_read 4 0 dnode_hold_alloc_hits 4 3804690 dnode_hold_alloc_misses 4 216 dnode_hold_alloc_interior 4 3 dnode_hold_alloc_lock_retry 4 0 dnode_hold_alloc_lock_misses 4 0 dnode_hold_alloc_type_none 4 0 dnode_hold_free_hits 4 203105 dnode_hold_free_misses 4 4 dnode_hold_free_lock_misses 4 0 dnode_hold_free_lock_retry 4 0 dnode_hold_free_overflow 4 0 dnode_hold_free_refcount 4 57 dnode_hold_free_txg 4 0 dnode_allocate 4 203154 dnode_reallocate 4 0 dnode_buf_evict 4 23918 dnode_alloc_next_chunk 4 4887 dnode_alloc_race 4 0 dnode_alloc_next_block 4 18 The performance is slightly improved for concurrent creates with 16+ threads, and unchanged for low thread counts. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov>	2017-09-13 15:46:15 -07:00
dbavatar	89950722c6	Linux 4.8+ compatibility fix for vm stats vm_node_stat must be used instead of vm_zone_stat. Unfortunately the old code still compiles potentially leading to silent failure of arc_evictable_memory() AKAMAI: CR 3816601: Regression in zfs dropcache test Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Closes #6528	2017-09-13 14:21:59 -07:00
LOLi	ae5b4a05ff	Fix range locking in ZIL commit codepath Since OpenZFS 7578 (`1b7c1e5`) if we have a ZVOL with logbias=throughput we will force WR_INDIRECT itxs in zvol_log_write() setting itx->itx_lr offset and length to the offset and length of the BIO from zvol_write()->zvol_log_write(): these offset and length are later used to take a range lock in zillog->zl_get_data function: zvol_get_data(). Now suppose we have a ZVOL with blocksize=8K and push 4K writes to offset 0: we will only be range-locking 0-4096. This means the ASSERTion we make in dbuf_unoverride() is no longer valid because now dmu_sync() is called from zilog's get_data functions holding a partial lock on the dbuf. Fix this by taking a range lock on the whole block in zvol_get_data(). Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6238 Closes #6315 Closes #6356 Closes #6477	2017-08-21 16:46:54 -07:00
LOLi	3468fdbd34	Fix remounting snapshots read-write It's not enough to preserve/restore MS_RDONLY on the superblock flags to avoid remounting a snapshot read-write: be explicit about our intentions to the VFS layer so the readonly bit is updated correctly in do_remount_sb(). Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6510 Closes #6515	2017-08-21 16:46:52 -07:00
Chunwei Chen	aec4318870	Fix NULL pointer when O_SYNC read in snapshot When doing read on a file open with O_SYNC, it will trigger zil_commit. However for snapshot, there's no zil, so we shouldn't be doing that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #6478 Closes #6494	2017-08-21 16:41:22 -07:00
sanjeevbagewadi	2d9b57d39f	zio_dva_throttle_done() should allow zinjected ZIO If fault injection is enabled, the ZIO_FLAG_IO_RETRY could be set by zio_handle_device_injection() to generate the FMA events and update stats. Hence, ignore the flag and process such zios. A better fix would be to add another flag in the zio_t to indicate that the zio is failed because of a zinject rule. However, considering the fact that we do this in debug bits, we could do with the crude check using the global flag zio_injection_enabled which is set to 1 when zinject records are added. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com> Closes #6383 Closes #6384	2017-08-21 16:41:22 -07:00
Brian Behlendorf	751941e248	Fix dnode allocation race When performing concurrent object allocations using the new multi-threaded allocator and large dnodes it's possible to allocate overlapping large dnodes. This case should have been handled by detecting an error returned by dnode_hold_impl(). But that logic only checked the returned dnp was not-NULL, and the dnp variable was not reset to NULL when retrying. Resolve this issue by properly checking the return value of dnode_hold_impl(). Additionally, it was possible that dnode_hold_impl() would misreport a dnode as free when it was in fact in use. This could occurs for two reasons: * The per-slot zrl_lock must be held over the entire critical section which includes the alloc/free until the new dnode is assigned to children_dnodes. Additionally, all of the zrl_lock's in the range must be held to protect moving dnodes. * The dn->dn_ot_type cannot be solely relied upon to check the type. When allocating a new dnode its type will be DMU_OT_NONE after dnode_create(). Only latter when dnode_allocate() is called will it transition to the new type. This means there's a window when allocating where it can mistaken for a free dnode. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6414 Closes #6439	2017-08-08 10:17:33 -07:00
Ned Bass	ef605a5517	Add debug log entries for failed receive records Log contents of a receive record if an error occurs while writing it out to the pool. This may help determine the cause when backup streams are rejected as invalid. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #6465	2017-08-08 10:17:23 -07:00
Tony Hutter	07cbcd5089	Only record zio->io_delay on reads and writes While investigating https://github.com/zfsonlinux/zfs/issues/6425 I noticed that ioctl ZIOs were not setting zio->io_delay correctly. They would set the start time in zio_vdev_io_start(), but never set the end time in zio_vdev_io_done(), since ioctls skip it and go straight to zio_done(). This was causing spurious "delayed IO" events to appear, which would eventually get rate-limited and displayed as "Missed events" messages in zed. To get around the problem, this patch only sets zio->io_delay for read and write ZIOs, since that's all we care about anyway. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #6425 Closes #6440	2017-08-02 11:37:18 -07:00
LOLi	20c88dc3ef	Fix volmode=none property behavior at import time At import time spa_import() calls zvol_create_minors() directly: with the current implementation we have no way to avoid device node creation when volmode=none. Fix this by enforcing volmode=none directly in zvol_alloc(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6426	2017-08-02 11:21:14 -07:00
Giuseppe Di Natale	bff245dd34	OpenZFS 8508 - Mounting a zpool on 32-bit platforms panics Authored by: Justin Hibbits <chmeeedalf@gmail.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8508 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/15fc257 Closes #6404	2017-07-26 09:44:21 -07:00
Ned Bass	8740cf4a2f	Add line info and SET_ERROR() to ZFS debug log Redefine the SET_ERROR macro in terms of __dprintf() so the error return codes get logged as both tracepoint events (if tracepoints are enabled) and as ZFS debug log entries. This also allows us to use the same definition of SET_ERROR() in kernel and user space. Define a new debug flag ZFS_DEBUG_SET_ERROR=512 that may be bitwise or'd into zfs_flags. Setting this flag enables both dprintf() and SET_ERROR() messages in the debug log. That is, setting ZFS_DEBUG_SET_ERROR and ZFS_DEBUG_DPRINTF\|ZFS_DEBUG_SET_ERROR are equivalent (this was done for sake of simplicity). Leaving ZFS_DEBUG_SET_ERROR unset suppresses the SET_ERROR() messages which helps avoid cluttering up the logs. To enable SET_ERROR() logging, run: echo 1 > /sys/module/zfs/parameters/zfs_dbgmsg_enable echo 512 > /sys/module/zfs/parameters/zfs_flags Remove the zfs_set_error_class tracepoints event class since SET_ERROR() now uses __dprintf(). This sacrifices a bit of granularity when selecting individual tracepoint events to enable but it makes the code simpler. Include file, function, and line number information in debug log entries. The information is now added to the message buffer in __dprintf() and as a result the zfs_dprintf_class tracepoints event class was changed from a 4 parameter interface to a single parameter. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #6400	2017-07-25 23:09:48 -07:00
Ned Bass	73aac4aa41	Some additional send stream validity checking Check in the DMU whether an object record in a send stream being received contains an unsupported dnode slot count, and return an error if it does. Failure to catch an unsupported dnode slot count would result in a panic when the SPA attempts to increment the reference count for the large_dnode feature and the pool has the feature disabled. This is not normally an issue for a well-formed send stream which would have the DMU_BACKUP_FEATURE_LARGE_DNODE flag set if it contains large dnodes, so it will be rejected as unsupported if the required feature is disabled. This change adds a missing object record field validation. Add missing stream feature flag checks in dmu_recv_resume_begin_check(). Consolidate repetitive comment blocks in dmu_recv_begin_check(). Update zstreamdump to print the dnode slot count (dn_slots) for an object record when running in verbose mode. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #6396	2017-07-25 18:52:40 -07:00
Brian Behlendorf	3f759c0c73	Fix 'zpool clear' on suspended pools 'zpool clear' should be able to resume I/O on suspended, but otherwise healthy, pools. `4a283c7` accidentally introduced a new code path where we call txg_wait_synced() on the suspended pool before we had the chance to resume I/O via zio_resume(): this results in the 'zpool clear' command hanging indefinitely, waiting for a TXG that cannot be synced. Fix this by avoiding the call to txg_wait_synced(). Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6399	2017-07-25 12:20:52 -07:00
Olaf Faaland	e889f0f520	Report MMP_STATE_NO_HOSTID immediately There is no need to perform the activity check before detecting that the user must set the system hostid, because the pool's multihost property is on, but spa_get_hostid() returned 0. The initial call to vdev_uberblock_load() provided the information required. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #6388	2017-07-25 13:22:28 -04:00
Olaf Faaland	0582e40322	Add callback for zfs_multihost_interval Add a callback to wake all running mmp threads when zfs_multihost_interval is changed. This is necessary when the interval is changed from a very large value to a significantly lower one, while pools are imported that have the multihost property enabled. Without this commit, the mmp thread does not wake up and detect the new interval until after it has waited the old multihost interval time. A user monitoring mmp writes via the provided kstat would be led to believe that the changed setting did not work. Added a test in the ZTS under mmp to verify the new functionality is working. Added a test to ztest which starts and stops mmp threads, and calls into the code to signal sleeping mmp threads, to test for deadlocks or similar locking issues. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #6387	2017-07-25 13:22:20 -04:00
Olaf Faaland	ffb195c256	Release SCL_STATE in map_write_done() The config lock must be held for the duration of the MMP write. Since the I/Os are executed via map_nowait(), the done function is the only place where we know the write has completed. Since SCL_STATE is taken as reader, overlapping I/Os do not create a deadlock. The refcount is simply increased when new I/Os are queued and decreased when I/Os complete. Test case added which exercises the probe IO call path to verify the fix and prevent a regression. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #6394	2017-07-25 12:25:05 -04:00
Olaf Faaland	f43615d0cc	Revert Fix vdev_probe() call wrt SCL_STATE_ALL This reverts commit `cc9c6bc`, which has been causing intermittent test failures on buildbot. A correct fix for this locking issue has been applied in a separate patch. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov>	2017-07-25 12:24:42 -04:00
Olaf Faaland	b6e5c40382	Use correct macro for hz in mmp.c Commit `379ca9c` Multi-modifier protection (MMP) used HZ to convert nanoseconds to ticks for use with cv_timedwait() and ddi_get_lbolt(). The correct macro is hz, which is defined within the SPL for kernel space, and within zfs_context.h for user space. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #6357 Closes #6360	2017-07-24 11:22:10 -07:00
Giuseppe Di Natale	802ae562ed	Fix coverity defects: CID 165755 CID 165755: Division or modulo by zero (DIVIDE_BY_ZERO) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #6352	2017-07-24 11:16:58 -07:00
Brian Behlendorf	36ba27e9e0	Linux 4.13 compat: bio->bi_status and blk_status_t Commit torvalds/linux@4e4cbee9. The bio->bi_error field was replaced with bio->bi_status which is an enum that describes all possible error types. Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6351	2017-07-23 19:37:12 -07:00
Brian Behlendorf	cc9c6bcb73	Fix vdev_probe() call outside SCL_STATE_ALL lock When an IO fails then zio_vdev_io_done() can call vdev_probe() to determine the health of the vdev. This is safe as long as the original zio was submitted with zio_wait() and holds the SCL_STATE_ALL lock over the operation. If zio_no_wait() was used then the done callback will submit the probe IO outside the SCL_STATE_ALL lock and hit this ASSERT in zio_create() ASSERT(!vd \|\| spa_config_held(spa, SCL_STATE_ALL, RW_READER)); Resolve the issue by only allowing vdev_probe() to be called when there's a waiter indicating the caller is using zio_wait(). This assumes that caller is still holding SCL_STATE_ALL. This issue isn't MMP specific but was surfaced when testing. Without this patch it can be reproduced by running: zpool set multihost on <pool> zinject -d <vdev> -e io -T write -f 50 <pool> -L uber Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com> Closes #745 Closes #6279	2017-07-13 13:54:10 -04:00
Olaf Faaland	379ca9cf2b	Multi-modifier protection (MMP) Add multihost=on\|off pool property to control MMP. When enabled a new thread writes uberblocks to the last slot in each label, at a set frequency, to indicate to other hosts the pool is actively imported. These uberblocks are the last synced uberblock with an updated timestamp. Property defaults to off. During tryimport, find the "best" uberblock (newest txg and timestamp) repeatedly, checking for change in the found uberblock. Include the results of the activity test in the config returned by tryimport. These results are reported to user in "zpool import". Allow the user to control the period between MMP writes, and the duration of the activity test on import, via a new module parameter zfs_multihost_interval. The period is specified in milliseconds. The activity test duration is calculated from this value, and from the mmp_delay in the "best" uberblock found initially. Add a kstat interface to export statistics about Multiple Modifier Protection (MMP) updates. Include the last synced txg number, the timestamp, the delay since the last MMP update, the VDEV GUID, the VDEV label that received the last MMP update, and the VDEV path. Abbreviated output below. $ cat /proc/spl/kstat/zfs/mypool/multihost 31 0 0x01 10 880 105092382393521 105144180101111 txg timestamp mmp_delay vdev_guid vdev_label vdev_path 20468 261337 250274925 68396651780 3 /dev/sda 20468 261339 252023374 6267402363293 1 /dev/sdc 20468 261340 252000858 6698080955233 1 /dev/sdx 20468 261341 251980635 783892869810 2 /dev/sdy 20468 261342 253385953 8923255792467 3 /dev/sdd 20468 261344 253336622 042125143176 0 /dev/sdab 20468 261345 253310522 1200778101278 2 /dev/sde 20468 261346 253286429 0950576198362 2 /dev/sdt 20468 261347 253261545 96209817917 3 /dev/sds 20468 261349 253238188 8555725937673 3 /dev/sdb Add a new tunable zfs_multihost_history to specify the number of MMP updates to store history for. By default it is set to zero meaning that no MMP statistics are stored. When using ztest to generate activity, for automated tests of the MMP function, some test functions interfere with the test. For example, the pool is exported to run zdb and then imported again. Add a new ztest function, "-M", to alter ztest behavior to prevent this. Add new tests to verify the new functionality. Tests provided by Giuseppe Di Natale. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #745 Closes #6279	2017-07-13 13:54:00 -04:00
Dave Eddy	12fa0466df	OpenZFS 6939 - add sysevents to zfs core for commands Authored by: Dave Eddy <dave@daveeddy.com> Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Joshua M. Clulow <jmc@joyent.com> Reviewed by: Josh Wilsdon <jwilsdon@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Alan Somers <asomers@gmail.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6939 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ce1577b Closes #6328	2017-07-12 21:28:13 -07:00
LOLi	cf8738d853	Add port of FreeBSD 'volmode' property The volmode property may be set to control the visibility of ZVOL block devices. This allow switching ZVOL between three modes: full - existing fully functional behaviour (default) dev - hide partitions on ZVOL block devices none - not exposing volumes outside ZFS Additionally the new zvol_volmode module parameter can be used to control the default behaviour. This functionality can be used, for instance, on "backup" pools to avoid cluttering /dev with unneeded zd* devices. Original-patch-by: mav <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> FreeBSD-commit: https://github.com/freebsd/freebsd/commit/dd28e6bb Closes #1796 Closes #3438 Closes #6233	2017-07-12 13:05:37 -07:00
Yuri Pankov	e19572e4cc	OpenZFS 5428 - provide fts(), reallocarray(), and strtonum() Authored by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Joshua M. Clulow <josh@sysmgr.org> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting Notes: * All hunks unrelated to ZFS were dropped. OpenZFS-issue: https://www.illumos.org/issues/5428 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4585130 Closes #6326	2017-07-08 20:35:35 -07:00
Matthew Ahrens	2ade4a99f0	OpenZFS 8126 - ztest assertion failed in dbuf_dirty due to dn_nlevels changing The sync thread is concurrently modifying dn_phys->dn_nlevels while dbuf_dirty() is trying to assert something about it, without holding the necessary lock. We need to move this assertion further down in the function, after we have acquired the dn_struct_rwlock. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8126 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0ef125d Closes #6314	2017-07-07 14:58:33 -07:00
Matthew Ahrens	a896468c78	OpenZFS 8067 - zdb should be able to dump literal embedded block pointer Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8067 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8173085 Closes #6319	2017-07-07 11:28:01 -07:00
LOLi	92e43c1718	Fix 'zpool clear' on readonly pools Illumos 4080 inadvertently allows 'zpool clear' on readonly pools: fix this by reintroducing a check (POOL_CHECK_READONLY) in zfs_ioc_clear registration code. Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6306	2017-07-07 10:39:53 -07:00
Alek P	0ea05c64f8	Implemented zpool scrub pause/resume Currently, there is no way to pause a scrub. Pausing may be useful when the pool is busy with other I/O to preserve bandwidth. This patch adds the ability to pause and resume scrubbing. This is achieved by maintaining a persistent on-disk scrub state. While the state is 'paused' we do not scrub any more blocks. We do however perform regular scan housekeeping such as freeing async destroyed and deadlist blocks while paused. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Reviewed-by: Serapheim Dimitropoulos <serapheimd@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #6167	2017-07-06 22:16:13 -07:00
Arkadiusz Bubała	94b25662c5	Reschedule processes on -ERESTARTSYS On the single core machine the system may hang when the spa_namespare_lock acquisition fails in the zvol_first_open function. It returns -ERESTARTSYS error what causes the endless loop in __blkdev_get function. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Arkadiusz Bubała <arkadiusz.bubala@open-e.com> Closes #6283 Closes #6312	2017-07-06 08:38:24 -07:00
Matthew Ahrens	02dc43bc46	OpenZFS 8378 - crash due to bp in-memory modification of nopwrite block Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> The problem is that zfs_get_data() supplies a stale zgd_bp to dmu_sync(), which we then nopwrite against. zfs_get_data() doesn't hold any DMU-related locks, so after it copies db_blkptr to zgd_bp, dbuf_write_ready() could change db_blkptr, and dbuf_write_done() could remove the dirty record. dmu_sync() then sees the stale BP and that the dbuf it not dirty, so it is eligible for nop-writing. The fix is for dmu_sync() to copy db_blkptr to zgd_bp after acquiring the db_mtx. We could still see a stale db_blkptr, but if it is stale then the dirty record will still exist and thus we won't attempt to nopwrite. OpenZFS-issue: https://www.illumos.org/issues/8378 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3127742 Closes #6293	2017-07-04 15:41:24 -07:00
Andriy Gapon	8ca78ab002	OpenZFS 7600 - zfs rollback should pass target snapshot to kernel Authored by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> The existing kernel-side code only provides a method to rollback to a latest snapshot, whatever it happens to be at the time when the rollback is actually done. That could be unsafe or confusing in environments where concurrent DSL changes are possible as the resulting state could correspond to a newer or older snapshot than the originally requested one. This change allows to amend that method such that the rollback is performed only when the latest snapshot has a specific name. That is, if a new snapshot is concurrently created or the target snapshot is destroyed, then no rollback is done and EXDEV error is returned. New libzfs_core function lzc_rollback_to() is provided for the new functionality. libzfs is changed to use lzc_rollback_to() to implement zfs rollback command. Perhaps we should return different errors to distinguish the case where the desired snapshot exists but it's not the latest snapshot and the case where the desired snapshot does not exist. OpenZFS-issue: https://www.illumos.org/issues/7600 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3d645eb Closes #6292	2017-07-04 15:29:52 -07:00
Andriy Gapon	018503911c	OpenZFS 7910 - l2arc_write_buffers() may write beyond target_sz Authored by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7910 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cb6af4b Closes #6291	2017-07-04 15:28:58 -07:00
Matthew Ahrens	c6f6767eea	OpenZFS 8377 - Panic in bookmark deletion Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> The problem is that when dsl_bookmark_destroy_check() is executed from open context (the pre-check), it fills in dbda_success based on the existence of the bookmark. But the bookmark (or containing filesystem as in this case) can be destroyed before we get to syncing context. When we re-run dsl_bookmark_destroy_check() in syncing context, it will not add the deleted bookmark to dbda_success, intending for dsl_bookmark_destroy_sync() to not process it. But because the bookmark is still in dbda_success from the open-context call, we do try to destroy it. The fix is that dsl_bookmark_destroy_check() should not modify dbda_success when called from open context. OpenZFS-issue: https://www.illumos.org/issues/8377 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/b0b6fe3 Closes #6286	2017-06-30 11:11:01 -07:00
Matthew Ahrens	817b1b6e7b	Clean up large dnode code Resolves issues discovered when porting to OpenZFS. * Lint warnings. * Made dnode_move_impl() large dnode aware. This functionality is currently unused on Linux. Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #6262	2017-06-29 10:18:03 -07:00
chrisrd	b8a97fb101	Set arc_meta_limit, arc_dnode_limit on change Make zfs_arc_meta_limit_percent and zfs_arc_dnode_limit_percent behave as you would expect from zfs-module-parameters.5. - recalculate arc_meta_limit if zfs_arc_meta_limit_percent changes - recalculate arc_dnode_limit if zfs_arc_dnode_limit_percent changes - correctly set arc_meta_limit and arc_dnode_limit if zfs_arc_max or zfs_arc_meta_min changes Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Closes #6269	2017-06-29 09:57:27 -07:00
Tony Hutter	682ce104cd	GCC 7.1 fixes GCC 7.1 with will warn when we're not checking the snprintf() return code in cases where the buffer could be truncated. This patch either checks the snprintf return code (where applicable), or simply disables the warnings (ztest.c). Reviewed-by: Chunwei Chen <david.chen@osnexus.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #6253	2017-06-28 10:05:16 -07:00
Brian Behlendorf	2d678f779a	Cap maximum aggregate IO size Commit `8542ef8` allowed optional IOs to be aggregated beyond the specified aggregation limit. Since the aggregation limit was also used to enforce the maximum block size, setting `zfs_vdev_aggregation_limit=16777216` could result in an attempt to allocate an ABD larger than 16M. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6259 Closes #6270	2017-06-27 10:09:16 -07:00
Boris Protopopov	58404a73db	Refine use of zv_state_lock. Use zv_state_lock to protect all members of zvol_state structure, add relevant ASSERT()s. Take zv_suspend_lock before zv_state_lock, do not hold zv_state_lock across suspend/resume. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Closes #6226	2017-06-27 12:51:44 -04:00
Giuseppe Di Natale	82710e993a	OpenZFS 5220 - L2ARC does not support devices that do not provide 512B access Authored by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Approved by: Dan McDonald <danmcd@joyent.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/5220 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/403a8da Closes #6260	2017-06-26 17:32:43 -07:00
Giuseppe Di Natale	d12f91fde3	OpenZFS 8264 - want support for promoting datasets in libzfs_core Authored by: Andrew Stormont <astormont@racktopsystems.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@kebe.com> Approved by: Dan McDonald <danmcd@kebe.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8264 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a4b8c9a Closes #6254	2017-06-26 16:56:09 -07:00
Boris Protopopov	03928896e1	Call cv_signal() with mutex held In bqueue_dequeue(), call cv_signal() with bq_lock held. Re-enable rsend_009_pos to test the fix. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Closes #5887	2017-06-26 14:36:49 -07:00
Morgan Jones	d9ad3fea3b	Add kpreempt_disable/enable around CPU_SEQID uses In zfs/dmu_object and icp/core/kcf_sched, the CPU_SEQID macro should be surrounded by `kpreempt_disable` and `kpreempt_enable` calls to avoid a Linux kernel BUG warning. These code paths use the cpuid to minimize lock contention and is is safe to reschedule the process to a different processor at any time. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Morgan Jones <me@numin.it> Closes #6239	2017-06-19 09:43:16 -07:00
Don Brady	0241e491a0	Inject zinject(8) a percentage amount of dev errs In the original form of device error injection, it was an all or nothing situation. To help simulate intermittent error conditions, you can now specify a real number percentage value. This is also very useful for our ZFS fault diagnosis testing and for injecting intermittent errors during load testing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@intel.com> Closes #6227	2017-06-16 17:21:11 -07:00
Boris Protopopov	ef4be34a64	Avoid 'queue not locked' warning at pool import. Use queue_flag_set_unlocked() in zvol_alloc(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Issue #6226	2017-06-15 11:16:19 -07:00
LOLi	97f8d7961e	Fix zvol_state_t->zv_open_count race `5559ba0` added zv_state_lock to protect zvol_state_t internal data: this, however, doesn't guard zv->zv_open_count and zv->zv_disk->private_data in zvol_remove_minors_impl(). Fix this by taking zv->zv_state_lock before we check its zv_open_count. P1 (z_zvol) P2 (systemd-udevd) --- --- zvol_remove_minors_impl() : zv->zv_open_count==0 zvol_open() ->mutex_enter(zv_state_lock) : zv->zv_open_count++ ->mutex_exit(zv_state_lock) ->mutex_enter(zv->zv_state_lock) ->zvol_remove(zv) ->mutex_exit(zv->zv_state_lock) : zv->zv_disk->private_data = NULL ->zvol_free() -->ASSERT(zv->zv_open_count==0) * zvol_release() : zv = disk->private_data ->ASSERT(zv && zv->zv_open_count>0) * --- --- * ASSERT() fails Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6213	2017-06-15 11:08:45 -07:00

1 2 3 4 5 ...

1624 Commits