mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-05-25 03:37:45 +03:00

Author	SHA1	Message	Date
Matthew Ahrens	905edb405d	Illumos 5347 - idle pool may run itself out of space 5347 idle pool may run itself out of space Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://github.com/illumos/illumos-gate/commit/231aab8 https://github.com/illumos/illumos-gate/commit/4a92375 3642 https://www.illumos.org/issues/5347 https://github.com/zfsonlinux/zfs/commit/89b1cd6 (partial commit & fix) https://github.com/zfsonlinux/zfs/commit/fbeddd6 Illumos 4390 https://github.com/zfsonlinux/zfs/commit/2696dfa Illumos 3642, 3643 Porting notes: This is completing the partial fix from FreeBSD Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3586	2015-07-14 10:35:21 -07:00
Alexander Eremin	1cddb8c9ff	Illumos 5610 - zfs clone from different source and target pools produces coredump 5610 zfs clone from different source and target pools produces coredump Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://github.com/illumos/illumos-gate/commit/03b1c29 https://www.illumos.org/issues/5610 https://www.illumos.org/issues/5824 https://github.com/zfsonlinux/zfs/issues/2911 https://github.com/zfsonlinux/zfs/commit/9063f65 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3584	2015-07-14 10:27:46 -07:00
Richard Yao	0de7c552b6	Failure of userland copy should return EFAULT Many key internal functions pass system return codes that are safe to return to userland. In the case of ddi_copyin(9F), an error passes -1 and the documentation states very clearly that drivers should pass EFAULT to userland when this happens. http://illumos.org/man/9F/ddi_copyin This does not happen in the ZFS source code. I believe it should be changed to pass EFAULT. I caught this when writing man pages for the libzfs_core API. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3575	2015-07-14 10:20:35 -07:00
Boris Protopopov	b39c22b73c	Translate sync zio to sync bio Translate zio requests with ZIO_PRIORITY_SYNC_READ and ZIO_PRIORITY_SYNC_WRITE into synchronous bio requests by setting READ_SYNC and WRITE_SYNC flags. Specifically, WRITE_SYNC flag turns out to have a pronounced effect when writing to an SSD-based SLOG. When WRITE_SYNC is not set (WRITE is set instead), the block trace for a SLOG device looks as follows: ... 130,96 0 3 0.008968390 0 C W 830464 + 136 [0] 130,96 0 4 0.011999161 0 C W 830720 + 136 [0] 130,96 0 5 0.023955549 0 C W 831744 + 136 [0] 130,96 0 6 0.024337663 19775 A W 832000 + 136 <- (130,97) 829952 130,96 0 7 0.024338823 19775 Q W 832000 + 136 [z_wr_iss/6] 130,96 0 8 0.024340523 19775 G W 832000 + 136 [z_wr_iss/6] 130,96 0 9 0.024343187 19775 P N [z_wr_iss/6] 130,96 0 10 0.024344120 19775 I W 832000 + 136 [z_wr_iss/6] 130,96 0 11 0.026784405 0 UT N [swapper] 1 130,96 0 12 0.026805339 202 U N [kblockd/0] 1 130,96 0 13 0.026807199 202 D W 832000 + 136 [kblockd/0] 130,96 0 14 0.026966948 0 C W 832000 + 136 [0] 130,96 3 1 0.000449358 19788 A W 829952 + 136 <- (130,97) 827904 130,96 3 2 0.000450951 19788 Q W 829952 + 136 [z_wr_iss/19] 130,96 3 3 0.000453212 19788 G W 829952 + 136 [z_wr_iss/19] 130,96 3 4 0.000455956 19788 P N [z_wr_iss/19] 130,96 3 5 0.000457076 19788 I W 829952 + 136 [z_wr_iss/19] 130,96 3 6 0.002786349 0 UT N [swapper] 1 ... Here the 130,197 is the partition created on the log device when adding it to the pool, whereas the base device is 130,96. As one can see, the writes to the SLOG are not marked synchronous (the S is missing next to W), and the queue unplugs occur based on the timer (UT event) resulting in slightly over 2 msec latency of writes. This results in a sub-par performance of single stream synchronous writes (limited by latency of the SLOG). When the WRITE_SYNC is set, a similar trace looks as follows: ... 130,96 4 1 0.000000000 70714 A WS 4280576 + 136 <- (130,97) 4278528 130,96 4 2 0.000000832 70714 Q WS 4280576 + 136 [(null)] 130,96 4 3 0.000002109 70714 G WS 4280576 + 136 [(null)] 130,96 4 4 0.000003394 70714 P N [(null)] 130,96 4 5 0.000003846 70714 I WS 4280576 + 136 [(null)] 130,96 4 6 0.000004854 70714 D WS 4280576 + 136 [(null)] 130,96 5 1 0.000354487 70713 A WS 4280832 + 136 <- (130,97) 4278784 130,96 5 2 0.000355072 70713 Q WS 4280832 + 136 [(null)] 130,96 5 3 0.000356383 70713 G WS 4280832 + 136 [(null)] 130,96 5 4 0.000357635 70713 P N [(null)] 130,96 5 5 0.000358088 70713 I WS 4280832 + 136 [(null)] 130,96 5 6 0.000359191 70713 D WS 4280832 + 136 [(null)] 130,96 0 76 0.000159539 0 C WS 4280576 + 136 [0] 130,96 16 85 0.000742108 70718 A WS 4281088 + 136 <- (130,97) 4279040 130,96 16 86 0.000743197 70718 Q WS 4281088 + 136 [z_wr_iss/15] 130,96 16 87 0.000744450 70718 G WS 4281088 + 136 [z_wr_iss/15] 130,96 16 88 0.000745817 70718 P N [z_wr_iss/15] 130,96 16 89 0.000746705 70718 I WS 4281088 + 136 [z_wr_iss/15] 130,96 16 90 0.000747848 70718 D WS 4281088 + 136 [z_wr_iss/15] 130,96 0 77 0.000604063 0 C WS 4280832 + 136 [0] 130,96 0 78 0.000899858 0 C WS 4281088 + 136 [0] As one can see, all the writes are synchronous (WS), and I/O completions (e.g. from issue I to completion C) take 160-250 usec, or about 10x faster. Since WRITE_SYNC or READ_SYNC flags are among several factors that are considered when processing bio requests, it seems prudent to mark all the zio requests of synchronous priority with the READ/WRITE_SYNC flags to make them eligible for consideration as such by the Linux block I/O layer. Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3529	2015-07-13 14:28:50 -07:00
Brian Behlendorf	2b7b78fa5d	Fix switch-bool warning As of gcc version 5.1.1 a new warning has been added to detect the use of a boolean in a switch statement (-Wswitch-bool). Resolve the warning by explicitly casting the value to an integer type. zfs-0.6.4/module/zfs/zvol.c: In function 'zvol_request': error: switch condition has boolean value [-Werror=switch-bool] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-07-13 13:03:01 -07:00
Justin T. Gibbs	99197f034e	Illumos 5661 - ZFS: "compression = on" should use lz4 if feature is enabled 5661 ZFS: "compression = on" should use lz4 if feature is enabled Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Xin LI <delphij@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> References: https://github.com/illumos/illumos-gate/commit/db1741f https://www.illumos.org/issues/5661 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3571	2015-07-10 12:11:45 -07:00
Tim Chase	1cd777340b	Prevent reclaim in metaslab preload threads Reclaim during metaslab preloading can cause deadlocks involving znode z_lock and ARC buffer header ht_lock. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3532.	2015-07-06 09:36:13 -07:00
Alexander Motin	e16b3fcc61	Illumos 5008 - lock contention (rrw_exit) while running a read only load 5008 lock contention (rrw_exit) while running a read only load Reviewed by: Matthew Ahrens <matthew.ahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Richard Yao <ryao@gentoo.org> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Garrett D'Amore <garrett@damore.org> Porting notes: This patch ported perfectly cleanly to ZoL. During testing 100% cached small-block reads, extreme contention was noticed on rrl->rr_lock from rrw_exit() due to the frequent entering and leaving ZPL. Illumos picked up this patch from FreeBSD and it also helps under Linux. On a 1-minute 4K cached read test with 10 fio processes pinned to a single socket on a 4-socket (10 thread per socket) NUMA system, contentions on rrl->rr_lock were reduced from 508799 to 43085. Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3555	2015-07-06 09:34:13 -07:00
Matthew Ahrens	4bda3bd0e7	Illumos 5911 - ZFS "hangs" while deleting file 5911 ZFS "hangs" while deleting file Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Reviewed by: Alek Pinchuk <alek@nexenta.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/5911 https://github.com/illumos/illumos-gate/commit/46e1baa Porting notes: Resolved ISO C90 forbids mixed declarations and code wanting in the dnode_free_range() function. Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3554	2015-07-06 09:31:42 -07:00
Arne Jansen	5e8cd5d17f	Illumos 5981 - Deadlock in dmu_objset_find_dp 5981 Deadlock in dmu_objset_find_dp Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5981 https://github.com/illumos/illumos-gate/commit/1d3f896 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3553	2015-07-06 09:31:35 -07:00
Andriy Gapon	71e2fe41be	Illumos 5946, 5945 5946 zfs_ioc_space_snaps must check that firstsnap and lastsnap refer to snapshots 5945 zfs_ioc_send_space must ensure that fromsnap refers to a snapshot Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> References: https://www.illumos.org/issues/5946 https://www.illumos.org/issues/5945 https://github.com/illumos/illumos-gate/commit/24218be Ported-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3552	2015-07-06 09:31:30 -07:00
Andriy Gapon	b6640117f0	Illumos 5870 - dmu_recv_end_check() leaks origin_head hold if error happens in drc_force branch 5870 dmu_recv_end_check() leaks origin_head hold if error happens in drc_force branch Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5870 https://github.com/illumos/illumos-gate/commit/beddaa9 Ported-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3551	2015-07-06 09:22:18 -07:00
Andriy Gapon	fec417097b	Illumos 5909 - ensure that shared snap names don't become too long after promotion 5909 ensure that shared snap names don't become too long after promotion Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5909 https://github.com/illumos/illumos-gate/commit/cb5842f Ported-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3550	2015-07-06 09:21:30 -07:00
Andriy Gapon	cf50a2b08f	Illumos 5912 - full stream can not be force-received into a dataset if it has a snapshot 5912 full stream can not be force-received into a dataset if it has a snapshot Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5912 https://github.com/illumos/illumos-gate/commit/5bae108 Ported-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3549	2015-07-06 09:20:18 -07:00
Alek Pinchuk	a7b10a9319	Illumos 6033 - arc_adjust() should search MFU lists 6033 arc_adjust() should search MFU lists for oldest buffer when adjusting MFU size Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Xin Li <delphij@delphij.net> Reviewed by: Prakash Surya <me@prakashsurya.com> Approved by: Matthew Ahrens <mahrens@delphix.com> References: https://www.illumos.org/issues/6033 https://github.com/illumos/illumos-gate/commit/31c46cf Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3545	2015-07-01 11:09:15 -07:00
Matthew Ahrens	804e050457	Illumos 5175 - implement dmu_read_uio_dbuf() to improve cached read performance 5175 implement dmu_read_uio_dbuf() to improve cached read performance Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5175 https://github.com/illumos/illumos-gate/commit/f8554bb Porting notes: This patch doesn't include the changes for the COMSTAR (Common Multiprotocol SCSI Target) - since it's not available for ZoL. http://thegreyblog.blogspot.co.at/2010/02/setting-up-solaris-comstar-and.html Ported by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3392	2015-06-29 14:33:23 -07:00
Matthew Ahrens	c52fca13a0	Illumos 5368 - ARC should cache more metadata 5368 ARC should cache more metadata Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5368 https://github.com/illumos/illumos-gate/commit/3a5286a Porting Notes: The vast majority of this patch was already merged in the context of the `06358ea` changes. This is just a small hunk which was missed. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-25 08:58:17 -07:00
George Wilson	669dedb33f	Illumos 5163 - arc should reap range_seg_cache 5163 arc should reap range_seg_cache Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5163 https://github.com/illumos/illumos-gate/commit/83803b5 Porting Notes: Added umem_cache_reap_now() wrapped to suppress unused variable warning for user space build in arc_kmem_reap_now(). Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-25 08:58:16 -07:00
Brian Behlendorf	aa9af22cdf	Update all default taskq settings Over the years the default values for the taskqs used on Linux have differed slightly from illumos. In the vast majority of cases this was done to avoid creating an obnoxious number of idle threads which would pollute the process listing. With the addition of support for dynamic taskqs all multi-threaded queues should be created as dynamic taskqs. This allows us to get the best of both worlds. * The illumos default values for the I/O pipeline can be restored. These values are known to work well for most workloads. The only exception is the zio write interrupt taskq which is changed to ZTI_P(12, 8). At least under Linux more threads has been shown to improve performance, see commit `7e55f4e`. * Reduces the number of idle threads on the system when it's not under heavy load. The maximum number of threads will only be created when they are required. * Remove the vdev_file_taskq and rely on the system_taskq instead which is now dynamic and may have up to 64-threads. Again this brings us back inline with upstream. * Tasks dispatched with taskq_dispatch_ent() are allowed to use dynamic taskqs. The Linux taskq implementation supports this. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #3507	2015-06-25 08:58:16 -07:00
Andriy Gapon	ef56b0780c	Account for ashift when gathering buffers to be written to l2arc device If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit `3a17a7a9`. The change that introduced l2ad_hand alignment was almost correct as the write size was accumulated as a sum of rounded buffer sizes. See commit illumos/illumos-gate@e14bb32. Also, we now consistently use asize / a_sz for the allocated size and psize / p_sz for the physical size. The latter accounts for a possible size reduction because of the compression, whereas the former accounts for a possible subsequent size expansion because of the alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. Note that currently the cache device utilization is calculated based on the physical size, not the allocated size. The same applies to l2_asize kstat. That is wrong, but this commit does not fix that. The accounting problem was introduced partially in commit `3a17a7a9` and partially in 3038a2b (accounting became consistent but in favour of the wrong size). Porting Notes: Reworked to be C90 compatible and the 'write_psize' variable was removed because it is now unused. References: https://reviews.csiden.org/r/229/ https://reviews.freebsd.org/D2764 Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3400 Closes #3433 Closes #3451	2015-06-25 08:57:16 -07:00
Prakash Surya	d962d5dad9	Illumos 5701 - zpool list reports incorrect "alloc" value for cache devices 5701 zpool list reports incorrect "alloc" value for cache devices Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5701 https://github.com/illumos/illumos-gate/commit/a52fc31 Porting Notes: arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS); correctly placed at arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr). Ported by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-25 08:51:44 -07:00
Richard Yao	72540ea314	zfsdev_getminor() should check for invalid file handles Unit testing at ClusterHQ found that passing an invalid file handle to zfs_ioc_hold results in a NULL pointer dereference on a system without assertions: IP: [<ffffffffa0218aa0>] zfsdev_getminor+0x10/0x20 [zfs] Call Trace: [<ffffffffa021b4b0>] zfs_onexit_fd_hold+0x20/0x40 [zfs] [<ffffffffa0214043>] zfs_ioc_hold+0x93/0xd0 [zfs] [<ffffffffa0215890>] zfsdev_ioctl+0x200/0x500 [zfs] An assertion would have caught this had they been enabled, but this is something that the kernel module should handle without failing. We resolve this by searching the linked list to ensure that the file handle's private_data points to a valid zfsdev_state_t. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3506	2015-06-22 17:02:13 -07:00
Etienne Dechamps	99b14de421	Make metaslab_aliquot a module parameter. This seems generally useful. metaslab_aliquot is the ZFS allocation granularity, which is roughly equivalent to what is called the stripe size in traditional RAID arrays. It seems relevant to performance tuning. Signed-off-by: Etienne Dechamps <etienne@edechamps.fr> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-22 14:19:38 -07:00
Etienne Dechamps	e8fe6684a5	Document metaslab_aliquot. Signed-off-by: Etienne Dechamps <etienne@edechamps.fr> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-22 14:19:31 -07:00
Etienne Dechamps	bb3250d07e	Allocate disk space fairly in the presence of vdevs of unequal size. The metaslab allocator device selection algorithm contains a bias mechanism whose goal is to achieve roughly equal disk space usage across all top-level vdevs. It seems that the initial rationale for this code was to allow newly added (empty) vdevs to "come up to speed" faster in an attempt to make the pool quickly converge to a steady state where all vdevs are equally utilized. While the code seems to work reasonably well for this use case, there is another scenario in which this algorithm fails miserably: the case where top-level vdevs don't have the same sizes (capacities). ZFS allows this, and it is a good feature to have, so that users who simply want to build a pool with the disks they happen to have lying around can do so even if the disks have heteregenous sizes. Here's a script that simulates a pool with two vdevs, with one 4X larger than the other: dd if=/dev/zero of=/tmp/d1 bs=1 count=1 seek=134217728 dd if=/dev/zero of=/tmp/d2 bs=1 count=1 seek=536870912 zpool create testspace /tmp/d1 /tmp/d2 dd if=/dev/zero of=/testspace/foobar bs=1M count=256 zpool iostat -v testspace Before this commit, the script would output the following: capacity pool alloc free ---------- ----- ----- testspace 252M 375M /tmp/d1 104M 18.5M /tmp/d2 148M 356M ---------- ----- ----- This demonstrates that the current code handles this situation very poorly: d1 shows 85% usage despite the pool itself being only 40% full. d1 is quite saturated at this point, and is slowing down the entire pool due to saturation, fragmentation and the like. In contrast, here's the result with the code in this commit: capacity pool alloc free ---------- ----- ----- testspace 252M 375M /tmp/d1 56.7M 66.3M /tmp/d2 195M 309M ---------- ----- ------ This looks much better. d1 is 46% used, which is close to the overall pool utilization (40%). The code still doesn't result in perfectly balanced allocation, probably because of the way mg_bias is applied which does not guarantee perfect accuracy, but this is still much better than before. Signed-off-by: Etienne Dechamps <etienne@edechamps.fr> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3389	2015-06-22 14:18:29 -07:00
Brian Behlendorf	218b4e0a76	Add zfs_sb_prune_aliases() function For kernels which do not implement a per-suberblock shrinker, those older than Linux 3.1, the shrink_dcache_parent() function was used to attempt to reclaim dentries. This was found not be entirely reliable and could lead to performance issues on older kernels running meta-data heavy workloads. To address this issue a zfs_sb_prune_aliases() function has been added to implement this functionality. It relies on traversing the list of znodes for a filesystem and adding them to a private list with a reference held. The private list can then be safely walked outside the z_znodes_lock to prune dentires and drop the last reference so the inode can be freed. This provides the same synchronous behavior as the per-filesystem shrinker and has the advantage of depending on only long standing interfaces. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #3501	2015-06-22 10:22:49 -07:00
Brian Behlendorf	4c6a700910	Increase the number of iput taskq threads The number of threads in the iput taskq has been increased to speed up the number of iputs which can be handled. This has been observed to improve the meta data reclaim regardless of zfs_sb_prune() implementation in use. The taskq has also been renamed z_iput to for consistency with the rest of the I/O pipeline taskqs which are all named z_*. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>	2015-06-22 10:22:10 -07:00
Matus Kral	57ae840077	Linux 4.1 compat: use read_iter() / write_iter() Linux 3.15 commit torvalds/linux@293bc98 introduced two new methods. The ->read_iter() and ->write_iter() methods were designed to replace the ->aio_read() and ->aio_write() interfaces. Both interfaces were preserved for several kernel releases in order to migrate all existing consumers to the new interfaces. But as of Linux 4.1 the legacy interface has been retired and the ZFS code must be updated to use the new interfaces. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3352	2015-06-18 12:06:59 -07:00
Tim Chase	90947b2357	3.12 compat, NUMA-aware per-superblock shrinker Kernels >= 3.12 have a NUMA-aware superblock shrinker which is used in ZoL by zfs_sb_prune(). This patch calls the shrinker for each on-line NUMA node in order that memory be freed for each one. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3495	2015-06-17 10:43:13 -07:00
Brian Behlendorf	8e70975f90	Wait interruptibly in prefetch thread The Linux kernel watchdog will automatically dump a backtrace for any process while sleeps for over 120s in an uninterruptible state. The solution is for the prefetch thread to sleep in an interruptible state. The way the existing code was written this is safe because when woken it will always reevaluate its conditional. As a general rule it is preferable to sleep in an interruptible when possible. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3450 Closes #3402	2015-06-16 16:18:11 -07:00
Brian Behlendorf	b64ccd6c52	Rename cv_wait_interruptible() to cv_wait_sig() This is the counterpart to zfsonlinux/spl@2345368 which replaces the cv_wait_interruptible() function with cv_wait_sig(). There is no functional change to patch merely brings the function names in to sync to maximize portability. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3450 Issue #3402	2015-06-11 10:50:47 -07:00
Tim Chase	121b3cae74	Increase arc_c_min to allow safe operation of arc_adapt() ZoL had lowered the minimum ARC size to 4MiB to better accommodate tiny systems such as the raspberry pi, however, as of addition of large block support, the arc_adapt() function depends on arc_c being >= 32MiB (2 * SPA_MAXBLOCKSIZE). This patch raises the minimum ARC size to 32MiB and adds a VERIFY test to arc_adapt() for future-proofing. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
Brian Behlendorf	f604673836	Make arc_prune() asynchronous As described in the comment above arc_adapt_thread() it is critical that the arc_adapt_thread() function never sleep while holding a hash lock. This behavior was possible in the Linux implementation because the arc_prune() logic was implemented to be synchronous. Under illumos the analogous dnlc_reduce_cache() function is asynchronous. To address this the arc_do_user_prune() function is has been reworked in to two new functions as follows: * arc_prune_async() is an asynchronous implementation which dispatches the prune callback to be run by the system taskq. This makes it suitable to use in the context of the arc_adapt_thread(). * arc_prune() is a synchronous implementation which depends on the arc_prune_async() implementation but blocks until the outstanding callbacks complete. This is used in arc_kmem_reap_now() where it is safe, and expected, that memory will be freed. This patch additionally adds the zfs_arc_meta_strategy module option while allows the meta reclaim strategy to be configured. It defaults to a balanced strategy which has been proved to work well under Linux but the illumos meta-only strategy can be enabled. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
Brian Behlendorf	c5528b9ba6	Use taskq_wait_outstanding() function Replace taskq_wait() with taskq_wait_oustanding(). This way callers will only block until previously submitted tasks have been completed. This was the previous behavior of task_wait() prior to the introduction of taskq_wait_outstanding() so this isn't really a functionalty change for these callers. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
Prakash Surya	ca0bf58d65	Illumos 5497 - lock contention on arcs_mtx Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> Porting notes and other significant code changes: The illumos 5368 patch (ARC should cache more metadata), which was never picked up by ZoL, is mostly reverted by this patch. Since ZoL relies on the kernel asynchronously calling the shrinker to actually reap memory, the shrinker wakes up arc_reclaim_waiters_cv every time it runs. The arc_adapt_thread() function no longer calls arc_do_user_evicts() since the newly-added arc_user_evicts_thread() calls it periodically. Notable conflicting ZoL commits which conflicted with this patch or whose effects are either duplicated or un-done by this patch: `302f753` - Integrate ARC more tightly with Linux `39e055c` - Adjust arc_p based on "bytes" in arc_shrink `f521ce1` - Allow "arc_p" to drop to zero or grow to "arc_c" `77765b5` - Remove "arc_meta_used" from arc_adjust calculation `94520ca` - Prune metadata from ghost lists in arc_adjust_meta Trace support for multilist_insert() and multilist_remove() has been added and produces the following output: fio-12498 [077] .... 112936.448324: zfs_multilist__insert: ml { offset 240 numsublists 80 sublistidx 63 } fio-12498 [077] .... 112936.448347: zfs_multilist__remove: ml { offset 240 numsublists 80 sublistidx 29 } The following arcstats have been removed: recycle_miss - Used by arcstat.py and arc_summary.py, both of which have been updated appropriately. l2_writes_hdr_miss The following arcstats have been added: evict_not_enough - Number of times arc_evict_state() was unable to evict enough buffers to reach its target amount. evict_l2_skip - Number of times arc_evict_hdr() skipped eviction because it was being written to the l2arc. l2_writes_lock_retry - Replaces l2_writes_hdr_miss. Number of times l2arc_write_done() failed to acquire hash_lock (and re-tries). arc_meta_min - Shows the value of the zfs_arc_meta_min module parameter (see below). The "index" column of the "dbuf" kstat has been removed since it doesn't have a direct analog in the new multilist scheme. Additional multilist- related stats could be added in the future but would likely require extensions to the mulilist API. The following module parameters have been added: zfs_arc_evict_batch_limit - Number of ARC headers to free per sub-list before moving on to the next sub-list. zfs_arc_meta_min - Enforce a floor on the amount of metadata in the ARC. zfs_arc_num_sublists_per_state - Number of multilist sub-lists per ARC state. zfs_arc_overflow_shift - Controls amount by which the ARC must exceed the target size to be considered "overflowing". Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov	2015-06-11 10:27:25 -07:00
Chris Williamson	b9541d6b7d	Illumos 5408 - managing ZFS cache devices requires lots of RAM 5408 managing ZFS cache devices requires lots of RAM Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Don Brady <dev.fs.zfs@gmail.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> Porting notes: Due to the restructuring of the ARC-related structures, this patch conflicts with at least the following existing ZoL commits: `6e1d7276c9` Fix inaccurate arcstat_l2_hdr_size calculations The ARC_SPACE_HDRS constant no longer exists and has been somewhat equivalently replaced by HDR_L2ONLY_SIZE. `e0b0ca983d` Add visibility in to cached dbufs The new layering of l{1,2}arc_buf_hdr_t within the arc_buf_hdr struct requires additional structure member names to be used when referencing the inner items. Also, the presence of L1 or L2 inner member is indicated by flags using the new HDR_HAS_L{1,2}HDR macros. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
George Wilson	2a4324141f	Illumos 5369 - arc flags should be an enum 5369 arc flags should be an enum 5370 consistent arc_buf_hdr_t naming scheme Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Richard Lowe <richlowe@richlowe.net> Porting notes: ZoL has moved some ARC definitions into arc_impl.h. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: Tim Chase <tim@chase2k.com>	2015-06-11 10:27:25 -07:00
Tim Chase	ad4af89561	Partially revert "Add ddt, ddt_entry, and l2arc_hdr caches" This reverts only the l2arc_hdr part of commit `ecf3d9b8e6` in preparation for the illumos 5497 "lock contention on arcs_mtx" patch which does the same thing but uses the newer two-level ARC structure following the Illumos 5408 "managing ZFS cache devices requires lots of RAM" patch. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
Tim Chase	97639d0a52	Revert "Allow arc_evict_ghost() to only evict meta data" Illumos 5497 "lock contention on arcs_mtx" reworks eviction and obviates the need for this. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
Tim Chase	f6b3b1f5d6	Revert "fix l2arc compression buffers leak" This reverts commit `037763e44e` in preparation for the illumos 5497 "lock contention on arcs_mtx" patch which includes a fix for this very problem. ZoL had picked up a subset of the illumos 5497 patch to deal with the l2arc compression buffer leak. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:24 -07:00
Tim Chase	7807028ccd	Revert "arc_evict, arc_evict_ghost: reduce stack usage using kmem_zalloc" This reverts commit `16fcdea363` in preparation for the illumos 5497 "lock contention on arcs_mtx" patch which eliminates "marker" within the ARC code. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:24 -07:00
Brian Behlendorf	44de2f02d6	Remove unused variable in vdev_add_child() Commit `c3520e7` restructured vdev_add_child() in such a way that the spa variable was unused during non-debug builds. This is consistent with the upstream illumos code but because ZoL, unlike illumos, is built with all compiler warnings enabled this causes a legitimate warning. Revert this hunk of the patch to keep the build clean. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3432	2015-06-11 10:22:38 -07:00
Matthew Ahrens	c3520e7f1f	Illumos 5818 - zfs {ref}compressratio is incorrect with 4k sector size 5818 zfs {ref}compressratio is incorrect with 4k sector size Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Albert Lee <trisk@omniti.com> References: https://www.illumos.org/issues/5818 https://github.com/illumos/illumos-gate/commit/81cd5c5 Ported-by: Don Brady <don.brady@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3432	2015-06-10 16:24:01 -07:00
Arne Jansen	9c43027b3f	Illumos 5269 - zpool import slow 5269 zpool import slow Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5269 https://github.com/illumos/illumos-gate/commit/12380e1e Ported-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3396	2015-06-09 13:48:02 -07:00
Ned Bass	5f8e1e8505	dmu_objset_userquota_get_ids uses dn_bonus unsafely The function dmu_objset_userquota_get_ids() checks and uses dn->dn_bonus outside of dn_struct_rwlock. If the dnode is being freed then the bonus dbuf may be in the process of getting evicted. In this case there is a race that may cause dmu_objset_userquota_get_ids() to access the dbuf after it has been destroyed. To prevent this, ensure that when we are using the bonus dbuf we are either holding a reference on it or have taken dn_struct_rwlock. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3443	2015-06-05 12:41:17 -07:00
Ned Bass	d617648c7f	dbuf_try_add_ref minor bug fixes - Don't check db->bb_blkid, but use the blkid argument instead. Checking db->db_blkid may be unsafe since we doesn't yet have a hold on the dbuf so its validity is unknown. - Call mutex_exit() on found_db, not db, since it's not certain that they point to the same dbuf, and the mutex was taken on found_db. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3443	2015-06-05 12:40:38 -07:00
Matthew Ahrens	e5fd1dd682	Illumos 5243 - zdb -b could be much faster 5243 zdb -b could be much faster Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5243 https://github.com/illumos/illumos-gate/commit/f7950bf Ported-by: Don Brady <don.brady@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3414	2015-05-15 11:14:54 -07:00
Jan Sanislo	79065ed5a4	Return -ESTALE to force lookup for missing NFS file handles There seems to be a annoying problem using NFSv4 to access ZFS file systems under certain circumstances. It's easily reproduced: nfs_client1: mount server:/export /mnt nfs_client1: cd /mnt nfs_client1: echo foo >junk nfs_client1: cat junk foo Now on a different NFSv4 client: nfs_client2: mount server:/export /mnt nfs_client2: cd /mnt nfs_client2: vi junk # Make some changes to /mnt/junk and save # This change the inode associated with /mnt/junk Now back to the original client: nfs_client1: cat junk cat: junk: No such file or directory Admittedly NFSv4 is not advertised as a cluster file system that maintains a completely coherent view of data across multiple nodes. But it does have some mechanisms built in that try to deal with situations like the above. Namely, it employs specialized file handle lookup routines that return ESTALE when a file handle contains a non-existant inode value. The ESTALE return triggers a return full file path lookup from the client to determine if the file has actually gone away or if the cached file handle is no longer valid. ZFS behavior can be brought into line with other file systems (e.g., ext4) by applying the following patch: Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3224	2015-05-14 11:16:52 -07:00
Antonio Russo	7290cd3c4e	Relax restriction on zfs_ioc_next_obj() iteration Per the documentation for dnode_next_offset in dnode.c, the "txg" parameter specifies a lower bound on which transaction the dnode can be found in. We are interested in all dnodes that are removed between the first and last transaction in the snapshot. It doesn't need to be created in that snapshot to correspond to a removed file. In fact, the behavior of zfs diff in the test case exactly matches this: the transaction that created the data that was deleted in snapshot "2" was produced before, in snapshot "1", definitely predating the first transaction in snapshot "2". Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <Tim Chase <tim@onlight.com> Closes #2081	2015-05-14 11:16:08 -07:00
Brian Behlendorf	fd0fd6467b	Remove unused 'dsl_pool_t dp' variable When ASSERTs are compiled out by using the --disable-debug configure option. Then the local variable 'dsl_pool_t dp' will be unused and generate a compiler warning. Since this variable is only used once in the ASSERT replace it with 'ds->ds_dir->dd_pool'. This has the additional advantage of potentially saving a few bytes on the stack depending on how gcc decides to compile the function. This issue was not noticed immediately because the automated builders use --enable-debug to make the testing as rigorous as possible. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Closes #3410	2015-05-14 11:09:47 -07:00

... 3 4 5 6 7 ...

1125 Commits