mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2026-04-17 08:54:52 +03:00

Author	SHA1	Message	Date
Brian Behlendorf	7e8bddd019	Update arc_memory_throttle() to check pageout This brings the behavior of arc_memory_throttle() back in sync with illumos. The updated memory throttling policy roughly goes like this: * Never throttle if more than 10% of memory is free. This threshold is configurable with the zfs_arc_lotsfree_percent module option. * Minimize any throttling of kswapd even when free memory is below the set threshold. Allow it to write out pages as quickly as possible to help alleviate the memory pressure. * Delay all other threads when free memory is below the set threshold in order to avoid compounding the memory pressure. Buffers will be evicted from the ARC to reduce the issue. The Linux specific zfs_arc_memory_throttle_disable module option has been removed in favor of the existing zfs_arc_lotsfree_percent tuning. Setting zfs_arc_lotsfree_percent=0 will have the same effect as zfs_arc_memory_throttle_disable and it was therefore redundant. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3637	2015-07-30 11:52:12 -07:00
Brian Behlendorf	11f552fa90	Update arc_available_memory() to check freemem While Linux doesn't provide detailed information about the state of the VM it does provide us total free pages. This information should be incorporated in to the arc_available_memory() calculation rather than solely relying on a signal from direct reclaim. Conceptually this brings arc_available_memory() back in sync with illumos. It is also desirable that the target amount of free memory be tunable on a system. While the default values are expected to work well for most workloads there may be cases where custom values are needed. The zfs_arc_sys_free module option was added for this purpose. zfs_arc_sys_free - The target number of bytes the ARC should leave as free memory on the system. This value can checked in /proc/spl/kstat/zfs/arcstats and setting this module option will override the default value. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3637	2015-07-30 11:50:22 -07:00
Brian Behlendorf	6339c1b9dc	Bound zvol_threads module option The zvol_threads module option should be bounded to a reasonable range. The taskq must have at least 1 thread and shouldn't have more than 1,024 at most. The default value of 32 is a reasonable default. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3614	2015-07-29 07:42:11 -07:00
Chunwei Chen	21a96fb635	Fix "BUG: Bad page state" caused by writeback flag Commit `d958324` fixed the deadlock between page lock and range lock by unlocking the page lock before acquiring the range lock. However, this created a new issue #3075. The problem is that if we can't set the write back bit before releasing the page lock. Then other processes will be unaware that the page is under active write back. They may therefore truncate the page, invalidate the page, or not honor the sync semantics. To workaround this problem we re-dirty the page before dropping the page lock. While this doesn't prevent the page from being truncated it does ensure it won't be invalidated. Then the range lock and the page lock are reacquired in the correct deadlock-free order. Once both locks are safely held the page state can be rechecked. If all is well and the page is in the expect state the dirty bit can be removed, the write back bit set, and the page removed from the skip count. If not the page will be handled as appropriate. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3075	2015-07-29 07:38:15 -07:00
Brian Behlendorf	1229323d5f	Align thread priority with Linux defaults Under Linux filesystem threads responsible for handling I/O are normally created with the maximum priority. Non-I/O filesystem processes run with the default priority. ZFS should adopt the same priority scheme under Linux to maintain good performance and so that it will complete fairly when other Linux filesystems are active. The priorities have been updated to the following: $ ps -eLo rtprio,cls,pid,pri,nice,cmd \| egrep 'z_\|spl_\|zvol\|arc\|dbu\|meta' - TS 10743 19 -20 [spl_kmem_cache] - TS 10744 19 -20 [spl_system_task] - TS 10745 19 -20 [spl_dynamic_tas] - TS 10764 19 0 [dbu_evict] - TS 10765 19 0 [arc_prune] - TS 10766 19 0 [arc_reclaim] - TS 10767 19 0 [arc_user_evicts] - TS 10768 19 0 [l2arc_feed] - TS 10769 39 0 [z_unmount] - TS 10770 39 -20 [zvol] - TS 11011 39 -20 [z_null_iss] - TS 11012 39 -20 [z_null_int] - TS 11013 39 -20 [z_rd_iss] - TS 11014 39 -20 [z_rd_int_0] - TS 11022 38 -19 [z_wr_iss] - TS 11023 39 -20 [z_wr_iss_h] - TS 11024 39 -20 [z_wr_int_0] - TS 11032 39 -20 [z_wr_int_h] - TS 11033 39 -20 [z_fr_iss_0] - TS 11041 39 -20 [z_fr_int] - TS 11042 39 -20 [z_cl_iss] - TS 11043 39 -20 [z_cl_int] - TS 11044 39 -20 [z_ioctl_iss] - TS 11045 39 -20 [z_ioctl_int] - TS 11046 39 -20 [metaslab_group_] - TS 11050 19 0 [z_iput] - TS 11121 38 -19 [z_wr_iss] Note that under Linux the meaning of a processes priority is inverted with respect to illumos. High values on Linux indicate a _low_ priority while high value on illumos indicate a _high_ priority. In order to preserve the logical meaning of the minclsyspri and maxclsyspri macros when they are used by the illumos wrapper functions their values have been inverted. This way when changes are merged from upstream illumos we won't need to remember to invert the macro. It could also lead to confusion. This patch depends on https://github.com/zfsonlinux/spl/pull/466. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #3607	2015-07-28 13:36:47 -07:00
Brian Behlendorf	c97d30691c	Check for NULL in dmu_free_long_range_impl() A NULL should never be passed as the dnode_t pointer to the function dmu_free_long_range_impl(). Regardless, because we have a reported occurrence of this let's add some error handling to catch this. Better to report a reasonable error to caller than panic the system. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3445	2015-07-28 13:30:53 -07:00
Brian Behlendorf	96c080cb9c	Minor style cleanup Address minor differences in style between upstream and ZoL. This patch contains no functional differences and is solely designed to minimize the delta from upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:54 -07:00
Brian Behlendorf	3056818343	Remove double counting HDR_L2ONLY_SIZE Commit `d962d5d` didn't quite properly resolve the HDR_L2ONLY_SIZE accounting. Accounting is now performed only in the constructor and destructor which is a nice simplification. It should have been removed the from create and destroy functions. This brings up back in sync with upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:44 -07:00
Brian Behlendorf	8c8af9d807	Add hdr_recl() reclaim callback Originally removed because it wasn't required under Linux. However, there may still be some utility in signaling the arc reclaim thread under Linux via reclaim. This should already have happened by other means but it's not harmless and reduces another point of divergence with upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:40 -07:00
Brian Behlendorf	728d6ae91e	Reinstate zfs_arc_p_min_shift Commit `f521ce1` removed the minimum value for "arc_p" allowing it to drop to zero or grow to "arc_c". This was done to improve specific workload which constantly dirties new "metadata" but also frequently touches a "small" amount of mfu data (e.g. mkdir's). This change may still be desirable but it needs to be re-investigated. in the context of the recent ARC changes from upstream. Therefore this code is being restored to facilitate benchmarking. By setting "zfs_arc_p_min_shift=64" we easily compare the performance. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:32 -07:00
Prakash Surya	36da08ef9b	Illumos 5817 - change type of arcs_size from uint64_t to refcount_t 5817 change type of arcs_size from uint64_t to refcount_t Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5817 https://github.com/illumos/illumos-gate/commit/2fd872a Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:28 -07:00
Prakash Surya	500445c046	Illumos 5445 - Add more visibility via arcstats 5445 Add more visibility via arcstats; specifically arc_state_t stats and differentiate between "data" and "metadata" Reviewed by: Basil Crow <basil.crow@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Bayard Bell <bayard.bell@nexenta.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5445 https://github.com/illumos/illumos-gate/commit/4076b1b Porting Notes: This patch is an improved version of `cc7f677` which was previously merged in ZoL. This patch incorporates the additional improvements which were made upstream. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:06 -07:00
Matthew Ahrens	ca67b33aba	Illumos 5376 - arc_kmem_reap_now() should not result in clearing arc_no_grow 5376 arc_kmem_reap_now() should not result in clearing arc_no_grow Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5376 https://github.com/illumos/illumos-gate/commit/2ec99e3 Porting Notes: The good news is that many of the recent changes made upstream to the ARC tackled issues previously observed by ZoL with similar solutions. The bad news is those solution weren't identical to the ones we applied. This patch is designed to split the difference and apply as much of the upstream work as possible. * The arc_available_memory() function was removed previous in ZoL but due to the upstream changes it makes sense to add it back. This function has been customized for Linux so that it can be used to determine a low memory. This provides the same basic functionality as the illumos version allowing us to minimize changes through the rest of the code base. The exact mechanism used to detect a low memory state remains unchanged so this change isn't a significant as it might first appear. * This patch includes the long standing fix for arc_shrink() which was originally proposed in #2167. Since there were related changes to this function it made sense to include that work. * The arc_init() function has been re-factored. As before it sets sane default values for the ARC but then calls arc_tuning_update() to apply user specific tuning made via module options. The arc_tuning_update() function is then called periodically by the arc_reclaim_thread() to apply changes to the tunings made during normal operation. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3616 Closes #2167	2015-07-23 09:41:28 -07:00
Brian Behlendorf	53b1d9794e	Add logic to try and recover an inode with an invalid mode When an inode is detected with invalid mode bits the safe thing to do is panic the system. This indicates a problem with the contents of a dnode and it should never be possible. This is the default behavior. Unfortunately, due to flaws in the system attribute (SA) implementation (on all platforms) it was possible that ZFS could create a damaged dnode. This was a rare issue which only impacted dnodes which used a spill block. Normally only symlinks and files with ACLs would require a spill block. However, if the dataset had the xattr=sa property set and extended attributes were used this problem could occur. As of the 0.6.4 tag the root cause of this issue has been fixed. For pools which are exhibiting this damage the 'zfs_recover=1' module option may be set. This will cause ZFS to interpret the dnode with invalid mode bits as a normal file. This may allow the files to be accessed for recovery purposes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3548	2015-07-17 15:33:35 -07:00
Turbo Fredriksson	47a4a6fd5f	Support parallel build trees (VPATH builds) Build products from an out of tree build should be written relative to the build directory. Sources should be referred to by their locations in the source directory. This is accomplished by adding the 'src' and 'obj' variables for the module Makefile.am, using relative paths to reference source files, and by setting VPATH when source files are not co-located with the Makefile. This enables the following: $ mkdir build $ cd build $ ../configure \ --with-spl=$HOME/src/git/spl/ \ --with-spl-obj=$HOME/src/git/spl/build $ make -s This change also has the advantage of resolving the following warning which is generated by modern versions of automake. Makefile.am:00: warning: source file 'xxx' is in a subdirectory, Makefile.am:00: but option 'subdir-objects' is disabled Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1082	2015-07-17 13:42:51 -07:00
Brian Behlendorf	2a53e2dacc	Update inode under range lock After a successful write the inode must be updated under the range lock. If it is updated after dropping the lock there exists a race where the znode and inode wile disagree about the file size. This could result in narrow window of time where read(2) is able to access data beyond what fstat(2) reports as the file size. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #3601	2015-07-17 09:18:22 -07:00
Brian Behlendorf	bd29109f1a	Linux 4.2 compat: follow_link() / put_link() As of Linux 4.2 the kernel has completely retired the nameidata structure. One of the few remaining consumers of this interface were the follow_link() and put_link() callbacks. This patch adds the required checks to configure to detect the interface change and updates the functions accordingly. Migrating to the simple_follow_link() interface was considered but was decided against ironically due to the increased complexity. It also should be noted that the kernel follow_link() and put_link() interfaces changes several times after 4.1 and but before 4.2. This means there is a narrow range of kernel commits which never appear in an official tag of the Linux kernel which ZoL will not build. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Issue #3596	2015-07-17 09:18:16 -07:00
Brian Behlendorf	7eb333fbdd	Linux 4.2 compat: remove bio->bi_cnt access Linux 4.2 commit torvalds/linux@dac5621 renamed bio->bi_cnt to bio->__bi_cnt. Because this value is only used once in a block of debug code it simplest just to remove the PANIC. To my knowledge this debugging has never been hit or proved useful so this is no great loss. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #3596	2015-07-17 09:16:08 -07:00
Matthew Ahrens	905edb405d	Illumos 5347 - idle pool may run itself out of space 5347 idle pool may run itself out of space Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://github.com/illumos/illumos-gate/commit/231aab8 https://github.com/illumos/illumos-gate/commit/4a92375 3642 https://www.illumos.org/issues/5347 https://github.com/zfsonlinux/zfs/commit/89b1cd6 (partial commit & fix) https://github.com/zfsonlinux/zfs/commit/fbeddd6 Illumos 4390 https://github.com/zfsonlinux/zfs/commit/2696dfa Illumos 3642, 3643 Porting notes: This is completing the partial fix from FreeBSD Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3586	2015-07-14 10:35:21 -07:00
Alexander Eremin	1cddb8c9ff	Illumos 5610 - zfs clone from different source and target pools produces coredump 5610 zfs clone from different source and target pools produces coredump Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://github.com/illumos/illumos-gate/commit/03b1c29 https://www.illumos.org/issues/5610 https://www.illumos.org/issues/5824 https://github.com/zfsonlinux/zfs/issues/2911 https://github.com/zfsonlinux/zfs/commit/9063f65 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3584	2015-07-14 10:27:46 -07:00
Richard Yao	0de7c552b6	Failure of userland copy should return EFAULT Many key internal functions pass system return codes that are safe to return to userland. In the case of ddi_copyin(9F), an error passes -1 and the documentation states very clearly that drivers should pass EFAULT to userland when this happens. http://illumos.org/man/9F/ddi_copyin This does not happen in the ZFS source code. I believe it should be changed to pass EFAULT. I caught this when writing man pages for the libzfs_core API. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3575	2015-07-14 10:20:35 -07:00
Boris Protopopov	b39c22b73c	Translate sync zio to sync bio Translate zio requests with ZIO_PRIORITY_SYNC_READ and ZIO_PRIORITY_SYNC_WRITE into synchronous bio requests by setting READ_SYNC and WRITE_SYNC flags. Specifically, WRITE_SYNC flag turns out to have a pronounced effect when writing to an SSD-based SLOG. When WRITE_SYNC is not set (WRITE is set instead), the block trace for a SLOG device looks as follows: ... 130,96 0 3 0.008968390 0 C W 830464 + 136 [0] 130,96 0 4 0.011999161 0 C W 830720 + 136 [0] 130,96 0 5 0.023955549 0 C W 831744 + 136 [0] 130,96 0 6 0.024337663 19775 A W 832000 + 136 <- (130,97) 829952 130,96 0 7 0.024338823 19775 Q W 832000 + 136 [z_wr_iss/6] 130,96 0 8 0.024340523 19775 G W 832000 + 136 [z_wr_iss/6] 130,96 0 9 0.024343187 19775 P N [z_wr_iss/6] 130,96 0 10 0.024344120 19775 I W 832000 + 136 [z_wr_iss/6] 130,96 0 11 0.026784405 0 UT N [swapper] 1 130,96 0 12 0.026805339 202 U N [kblockd/0] 1 130,96 0 13 0.026807199 202 D W 832000 + 136 [kblockd/0] 130,96 0 14 0.026966948 0 C W 832000 + 136 [0] 130,96 3 1 0.000449358 19788 A W 829952 + 136 <- (130,97) 827904 130,96 3 2 0.000450951 19788 Q W 829952 + 136 [z_wr_iss/19] 130,96 3 3 0.000453212 19788 G W 829952 + 136 [z_wr_iss/19] 130,96 3 4 0.000455956 19788 P N [z_wr_iss/19] 130,96 3 5 0.000457076 19788 I W 829952 + 136 [z_wr_iss/19] 130,96 3 6 0.002786349 0 UT N [swapper] 1 ... Here the 130,197 is the partition created on the log device when adding it to the pool, whereas the base device is 130,96. As one can see, the writes to the SLOG are not marked synchronous (the S is missing next to W), and the queue unplugs occur based on the timer (UT event) resulting in slightly over 2 msec latency of writes. This results in a sub-par performance of single stream synchronous writes (limited by latency of the SLOG). When the WRITE_SYNC is set, a similar trace looks as follows: ... 130,96 4 1 0.000000000 70714 A WS 4280576 + 136 <- (130,97) 4278528 130,96 4 2 0.000000832 70714 Q WS 4280576 + 136 [(null)] 130,96 4 3 0.000002109 70714 G WS 4280576 + 136 [(null)] 130,96 4 4 0.000003394 70714 P N [(null)] 130,96 4 5 0.000003846 70714 I WS 4280576 + 136 [(null)] 130,96 4 6 0.000004854 70714 D WS 4280576 + 136 [(null)] 130,96 5 1 0.000354487 70713 A WS 4280832 + 136 <- (130,97) 4278784 130,96 5 2 0.000355072 70713 Q WS 4280832 + 136 [(null)] 130,96 5 3 0.000356383 70713 G WS 4280832 + 136 [(null)] 130,96 5 4 0.000357635 70713 P N [(null)] 130,96 5 5 0.000358088 70713 I WS 4280832 + 136 [(null)] 130,96 5 6 0.000359191 70713 D WS 4280832 + 136 [(null)] 130,96 0 76 0.000159539 0 C WS 4280576 + 136 [0] 130,96 16 85 0.000742108 70718 A WS 4281088 + 136 <- (130,97) 4279040 130,96 16 86 0.000743197 70718 Q WS 4281088 + 136 [z_wr_iss/15] 130,96 16 87 0.000744450 70718 G WS 4281088 + 136 [z_wr_iss/15] 130,96 16 88 0.000745817 70718 P N [z_wr_iss/15] 130,96 16 89 0.000746705 70718 I WS 4281088 + 136 [z_wr_iss/15] 130,96 16 90 0.000747848 70718 D WS 4281088 + 136 [z_wr_iss/15] 130,96 0 77 0.000604063 0 C WS 4280832 + 136 [0] 130,96 0 78 0.000899858 0 C WS 4281088 + 136 [0] As one can see, all the writes are synchronous (WS), and I/O completions (e.g. from issue I to completion C) take 160-250 usec, or about 10x faster. Since WRITE_SYNC or READ_SYNC flags are among several factors that are considered when processing bio requests, it seems prudent to mark all the zio requests of synchronous priority with the READ/WRITE_SYNC flags to make them eligible for consideration as such by the Linux block I/O layer. Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3529	2015-07-13 14:28:50 -07:00
Brian Behlendorf	2b7b78fa5d	Fix switch-bool warning As of gcc version 5.1.1 a new warning has been added to detect the use of a boolean in a switch statement (-Wswitch-bool). Resolve the warning by explicitly casting the value to an integer type. zfs-0.6.4/module/zfs/zvol.c: In function 'zvol_request': error: switch condition has boolean value [-Werror=switch-bool] Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-07-13 13:03:01 -07:00
Justin T. Gibbs	99197f034e	Illumos 5661 - ZFS: "compression = on" should use lz4 if feature is enabled 5661 ZFS: "compression = on" should use lz4 if feature is enabled Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Xin LI <delphij@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> References: https://github.com/illumos/illumos-gate/commit/db1741f https://www.illumos.org/issues/5661 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3571	2015-07-10 12:11:45 -07:00
Josef 'Jeff' Sipek	411bf201f5	Illumos 4745 - fix AVL code misspellings 4745 fix AVL code misspellings Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Robert Mustacchi <rm@joyent.com> References: https://github.com/illumos/illumos-gate/commit/6907ca4 https://www.illumos.org/issues/4745 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3565	2015-07-10 11:58:37 -07:00
Tim Chase	1cd777340b	Prevent reclaim in metaslab preload threads Reclaim during metaslab preloading can cause deadlocks involving znode z_lock and ARC buffer header ht_lock. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3532.	2015-07-06 09:36:13 -07:00
Alexander Motin	e16b3fcc61	Illumos 5008 - lock contention (rrw_exit) while running a read only load 5008 lock contention (rrw_exit) while running a read only load Reviewed by: Matthew Ahrens <matthew.ahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Richard Yao <ryao@gentoo.org> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Garrett D'Amore <garrett@damore.org> Porting notes: This patch ported perfectly cleanly to ZoL. During testing 100% cached small-block reads, extreme contention was noticed on rrl->rr_lock from rrw_exit() due to the frequent entering and leaving ZPL. Illumos picked up this patch from FreeBSD and it also helps under Linux. On a 1-minute 4K cached read test with 10 fio processes pinned to a single socket on a 4-socket (10 thread per socket) NUMA system, contentions on rrl->rr_lock were reduced from 508799 to 43085. Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3555	2015-07-06 09:34:13 -07:00
Matthew Ahrens	4bda3bd0e7	Illumos 5911 - ZFS "hangs" while deleting file 5911 ZFS "hangs" while deleting file Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Reviewed by: Alek Pinchuk <alek@nexenta.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/5911 https://github.com/illumos/illumos-gate/commit/46e1baa Porting notes: Resolved ISO C90 forbids mixed declarations and code wanting in the dnode_free_range() function. Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3554	2015-07-06 09:31:42 -07:00
Arne Jansen	5e8cd5d17f	Illumos 5981 - Deadlock in dmu_objset_find_dp 5981 Deadlock in dmu_objset_find_dp Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5981 https://github.com/illumos/illumos-gate/commit/1d3f896 Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3553	2015-07-06 09:31:35 -07:00
Andriy Gapon	71e2fe41be	Illumos 5946, 5945 5946 zfs_ioc_space_snaps must check that firstsnap and lastsnap refer to snapshots 5945 zfs_ioc_send_space must ensure that fromsnap refers to a snapshot Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> References: https://www.illumos.org/issues/5946 https://www.illumos.org/issues/5945 https://github.com/illumos/illumos-gate/commit/24218be Ported-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3552	2015-07-06 09:31:30 -07:00
Andriy Gapon	b6640117f0	Illumos 5870 - dmu_recv_end_check() leaks origin_head hold if error happens in drc_force branch 5870 dmu_recv_end_check() leaks origin_head hold if error happens in drc_force branch Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5870 https://github.com/illumos/illumos-gate/commit/beddaa9 Ported-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3551	2015-07-06 09:22:18 -07:00
Andriy Gapon	fec417097b	Illumos 5909 - ensure that shared snap names don't become too long after promotion 5909 ensure that shared snap names don't become too long after promotion Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5909 https://github.com/illumos/illumos-gate/commit/cb5842f Ported-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3550	2015-07-06 09:21:30 -07:00
Andriy Gapon	cf50a2b08f	Illumos 5912 - full stream can not be force-received into a dataset if it has a snapshot 5912 full stream can not be force-received into a dataset if it has a snapshot Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5912 https://github.com/illumos/illumos-gate/commit/5bae108 Ported-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3549	2015-07-06 09:20:18 -07:00
Alek Pinchuk	a7b10a9319	Illumos 6033 - arc_adjust() should search MFU lists 6033 arc_adjust() should search MFU lists for oldest buffer when adjusting MFU size Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Xin Li <delphij@delphij.net> Reviewed by: Prakash Surya <me@prakashsurya.com> Approved by: Matthew Ahrens <mahrens@delphix.com> References: https://www.illumos.org/issues/6033 https://github.com/illumos/illumos-gate/commit/31c46cf Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3545	2015-07-01 11:09:15 -07:00
Matthew Ahrens	804e050457	Illumos 5175 - implement dmu_read_uio_dbuf() to improve cached read performance 5175 implement dmu_read_uio_dbuf() to improve cached read performance Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5175 https://github.com/illumos/illumos-gate/commit/f8554bb Porting notes: This patch doesn't include the changes for the COMSTAR (Common Multiprotocol SCSI Target) - since it's not available for ZoL. http://thegreyblog.blogspot.co.at/2010/02/setting-up-solaris-comstar-and.html Ported by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3392	2015-06-29 14:33:23 -07:00
Matthew Ahrens	c52fca13a0	Illumos 5368 - ARC should cache more metadata 5368 ARC should cache more metadata Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5368 https://github.com/illumos/illumos-gate/commit/3a5286a Porting Notes: The vast majority of this patch was already merged in the context of the `06358ea` changes. This is just a small hunk which was missed. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-25 08:58:17 -07:00
George Wilson	669dedb33f	Illumos 5163 - arc should reap range_seg_cache 5163 arc should reap range_seg_cache Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5163 https://github.com/illumos/illumos-gate/commit/83803b5 Porting Notes: Added umem_cache_reap_now() wrapped to suppress unused variable warning for user space build in arc_kmem_reap_now(). Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-25 08:58:16 -07:00
Brian Behlendorf	aa9af22cdf	Update all default taskq settings Over the years the default values for the taskqs used on Linux have differed slightly from illumos. In the vast majority of cases this was done to avoid creating an obnoxious number of idle threads which would pollute the process listing. With the addition of support for dynamic taskqs all multi-threaded queues should be created as dynamic taskqs. This allows us to get the best of both worlds. * The illumos default values for the I/O pipeline can be restored. These values are known to work well for most workloads. The only exception is the zio write interrupt taskq which is changed to ZTI_P(12, 8). At least under Linux more threads has been shown to improve performance, see commit `7e55f4e`. * Reduces the number of idle threads on the system when it's not under heavy load. The maximum number of threads will only be created when they are required. * Remove the vdev_file_taskq and rely on the system_taskq instead which is now dynamic and may have up to 64-threads. Again this brings us back inline with upstream. * Tasks dispatched with taskq_dispatch_ent() are allowed to use dynamic taskqs. The Linux taskq implementation supports this. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #3507	2015-06-25 08:58:16 -07:00
Andriy Gapon	ef56b0780c	Account for ashift when gathering buffers to be written to l2arc device If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit `3a17a7a9`. The change that introduced l2ad_hand alignment was almost correct as the write size was accumulated as a sum of rounded buffer sizes. See commit illumos/illumos-gate@e14bb32. Also, we now consistently use asize / a_sz for the allocated size and psize / p_sz for the physical size. The latter accounts for a possible size reduction because of the compression, whereas the former accounts for a possible subsequent size expansion because of the alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. Note that currently the cache device utilization is calculated based on the physical size, not the allocated size. The same applies to l2_asize kstat. That is wrong, but this commit does not fix that. The accounting problem was introduced partially in commit `3a17a7a9` and partially in 3038a2b (accounting became consistent but in favour of the wrong size). Porting Notes: Reworked to be C90 compatible and the 'write_psize' variable was removed because it is now unused. References: https://reviews.csiden.org/r/229/ https://reviews.freebsd.org/D2764 Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3400 Closes #3433 Closes #3451	2015-06-25 08:57:16 -07:00
Prakash Surya	d962d5dad9	Illumos 5701 - zpool list reports incorrect "alloc" value for cache devices 5701 zpool list reports incorrect "alloc" value for cache devices Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5701 https://github.com/illumos/illumos-gate/commit/a52fc31 Porting Notes: arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS); correctly placed at arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr). Ported by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-25 08:51:44 -07:00
Richard Yao	72540ea314	zfsdev_getminor() should check for invalid file handles Unit testing at ClusterHQ found that passing an invalid file handle to zfs_ioc_hold results in a NULL pointer dereference on a system without assertions: IP: [<ffffffffa0218aa0>] zfsdev_getminor+0x10/0x20 [zfs] Call Trace: [<ffffffffa021b4b0>] zfs_onexit_fd_hold+0x20/0x40 [zfs] [<ffffffffa0214043>] zfs_ioc_hold+0x93/0xd0 [zfs] [<ffffffffa0215890>] zfsdev_ioctl+0x200/0x500 [zfs] An assertion would have caught this had they been enabled, but this is something that the kernel module should handle without failing. We resolve this by searching the linked list to ensure that the file handle's private_data points to a valid zfsdev_state_t. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Andriy Gapon <avg@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3506	2015-06-22 17:02:13 -07:00
Etienne Dechamps	99b14de421	Make metaslab_aliquot a module parameter. This seems generally useful. metaslab_aliquot is the ZFS allocation granularity, which is roughly equivalent to what is called the stripe size in traditional RAID arrays. It seems relevant to performance tuning. Signed-off-by: Etienne Dechamps <etienne@edechamps.fr> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-22 14:19:38 -07:00
Etienne Dechamps	e8fe6684a5	Document metaslab_aliquot. Signed-off-by: Etienne Dechamps <etienne@edechamps.fr> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-22 14:19:31 -07:00
Etienne Dechamps	bb3250d07e	Allocate disk space fairly in the presence of vdevs of unequal size. The metaslab allocator device selection algorithm contains a bias mechanism whose goal is to achieve roughly equal disk space usage across all top-level vdevs. It seems that the initial rationale for this code was to allow newly added (empty) vdevs to "come up to speed" faster in an attempt to make the pool quickly converge to a steady state where all vdevs are equally utilized. While the code seems to work reasonably well for this use case, there is another scenario in which this algorithm fails miserably: the case where top-level vdevs don't have the same sizes (capacities). ZFS allows this, and it is a good feature to have, so that users who simply want to build a pool with the disks they happen to have lying around can do so even if the disks have heteregenous sizes. Here's a script that simulates a pool with two vdevs, with one 4X larger than the other: dd if=/dev/zero of=/tmp/d1 bs=1 count=1 seek=134217728 dd if=/dev/zero of=/tmp/d2 bs=1 count=1 seek=536870912 zpool create testspace /tmp/d1 /tmp/d2 dd if=/dev/zero of=/testspace/foobar bs=1M count=256 zpool iostat -v testspace Before this commit, the script would output the following: capacity pool alloc free ---------- ----- ----- testspace 252M 375M /tmp/d1 104M 18.5M /tmp/d2 148M 356M ---------- ----- ----- This demonstrates that the current code handles this situation very poorly: d1 shows 85% usage despite the pool itself being only 40% full. d1 is quite saturated at this point, and is slowing down the entire pool due to saturation, fragmentation and the like. In contrast, here's the result with the code in this commit: capacity pool alloc free ---------- ----- ----- testspace 252M 375M /tmp/d1 56.7M 66.3M /tmp/d2 195M 309M ---------- ----- ------ This looks much better. d1 is 46% used, which is close to the overall pool utilization (40%). The code still doesn't result in perfectly balanced allocation, probably because of the way mg_bias is applied which does not guarantee perfect accuracy, but this is still much better than before. Signed-off-by: Etienne Dechamps <etienne@edechamps.fr> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3389	2015-06-22 14:18:29 -07:00
Brian Behlendorf	218b4e0a76	Add zfs_sb_prune_aliases() function For kernels which do not implement a per-suberblock shrinker, those older than Linux 3.1, the shrink_dcache_parent() function was used to attempt to reclaim dentries. This was found not be entirely reliable and could lead to performance issues on older kernels running meta-data heavy workloads. To address this issue a zfs_sb_prune_aliases() function has been added to implement this functionality. It relies on traversing the list of znodes for a filesystem and adding them to a private list with a reference held. The private list can then be safely walked outside the z_znodes_lock to prune dentires and drop the last reference so the inode can be freed. This provides the same synchronous behavior as the per-filesystem shrinker and has the advantage of depending on only long standing interfaces. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #3501	2015-06-22 10:22:49 -07:00
Brian Behlendorf	4c6a700910	Increase the number of iput taskq threads The number of threads in the iput taskq has been increased to speed up the number of iputs which can be handled. This has been observed to improve the meta data reclaim regardless of zfs_sb_prune() implementation in use. The taskq has also been renamed z_iput to for consistency with the rest of the I/O pipeline taskqs which are all named z_*. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com>	2015-06-22 10:22:10 -07:00
Matus Kral	57ae840077	Linux 4.1 compat: use read_iter() / write_iter() Linux 3.15 commit torvalds/linux@293bc98 introduced two new methods. The ->read_iter() and ->write_iter() methods were designed to replace the ->aio_read() and ->aio_write() interfaces. Both interfaces were preserved for several kernel releases in order to migrate all existing consumers to the new interfaces. But as of Linux 4.1 the legacy interface has been retired and the ZFS code must be updated to use the new interfaces. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3352	2015-06-18 12:06:59 -07:00
Tim Chase	90947b2357	3.12 compat, NUMA-aware per-superblock shrinker Kernels >= 3.12 have a NUMA-aware superblock shrinker which is used in ZoL by zfs_sb_prune(). This patch calls the shrinker for each on-line NUMA node in order that memory be freed for each one. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3495	2015-06-17 10:43:13 -07:00
Brian Behlendorf	8e70975f90	Wait interruptibly in prefetch thread The Linux kernel watchdog will automatically dump a backtrace for any process while sleeps for over 120s in an uninterruptible state. The solution is for the prefetch thread to sleep in an interruptible state. The way the existing code was written this is safe because when woken it will always reevaluate its conditional. As a general rule it is preferable to sleep in an interruptible when possible. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3450 Closes #3402	2015-06-16 16:18:11 -07:00
Brian Behlendorf	b64ccd6c52	Rename cv_wait_interruptible() to cv_wait_sig() This is the counterpart to zfsonlinux/spl@2345368 which replaces the cv_wait_interruptible() function with cv_wait_sig(). There is no functional change to patch merely brings the function names in to sync to maximize portability. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3450 Issue #3402	2015-06-11 10:50:47 -07:00

1 2 3 4 5 ...

969 Commits