mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2024-11-17 18:11:00 +03:00

Author	SHA1	Message	Date
Gaurav Kumar	cf2731e65b	arc_meta_limit should be updated when arc_max is changed. When arc_max is increased, arc_meta_limit will not be updated to 3/4 of the new arc_c_max value. This was done originally to preserve any existing maximum value. This turned out to be counter intuitive to users and this fix changes that behavior. If zfs_arc_meta_limit is non-default, it will be picked up later in the ARC tuning function. Signed-off-by: Gaurav Kumar <gaurav.kumar@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4893	2016-08-02 13:43:36 -07:00
Tim Chase	25458cbef9	Limit the amount of dnode metadata in the ARC Metadata-intensive workloads can cause the ARC to become permanently filled with dnode_t objects as they're pinned by the VFS layer. Subsequent data-intensive workloads may only benefit from about 25% of the potential ARC (arc_c_max - arc_meta_limit). In order to help track metadata usage more precisely, the other_size metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size. The new zfs_arc_dnode_limit tunable, which defaults to 10% of zfs_arc_meta_limit, defines the minimum number of bytes which is desirable to be consumed by dnodes. Attempts to evict non-metadata will trigger async prune tasks if the space used by dnodes exceeds this limit. The new zfs_arc_dnode_reduce_percent tunable specifies the amount by which the excess dnode space is attempted to be pruned as a percentage of the amount by which zfs_arc_dnode_limit is being exceeded. By default, it tries to unpin 10% of the dnodes. The problem of dnode metadata pinning was observed with the following testing procedure (in this example, zfs_arc_max is set to 4GiB): - Create a large number of small files until arc_meta_used exceeds arc_meta_limit (3GiB with default tuning) and arc_prune starts increasing. - Create a 3GiB file with dd. Observe arc_mata_used. It will still be around 3GiB. - Repeatedly read the 3GiB file and observe arc_meta_limit as before. It will continue to stay around 3GiB. With this modification, space for the 3GiB file is gradually made available as subsequent demands on the ARC are made. The previous behavior can be restored by setting zfs_arc_dnode_limit to the same value as the zfs_arc_meta_limit. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4345 Issue #4512 Issue #4773 Closes #4858	2016-07-25 15:26:38 -07:00
Tim Chase	8887c7d778	Prevent null dereferences when accessing dbuf kstat In arc_buf_info(), the arc_buf_t may have no header. If not, don't try to fetch the arc buffer stats and instead just zero them. The null dereferences were observed while accessing the dbuf kstat with awk on a system in which millions of small files were being created in order to overflow the system's metadata limit. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Closes #4837	2016-07-14 16:25:03 -07:00
Paul Dagnelie	bc77ba73fe	OpenZFS 6513 - partially filled holes lose birth time Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Richard Lowe <richlowe@richlowe.net>a Ported by: Boris Protopopov <bprotopopov@actifio.com> Signed-off-by: Boris Protopopov <bprotopopov@actifio.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6513 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/8df0bcf0 If a ZFS object contains a hole at level one, and then a data block is created at level 0 underneath that l1 block, l0 holes will be created. However, these l0 holes do not have the birth time property set; as a result, incremental sends will not send those holes. Fix is to modify the dbuf_read code to fill in birth time data.	2016-06-21 10:55:13 -07:00
Chunwei Chen	4442f60d8e	Fix arc_prune_task use-after-free arc_prune_task uses a refcount to protect arc_prune_t, but it doesn't prevent the underlying zsb from disappearing if there's a concurrent umount. We fix this by force the caller of arc_remove_prune_callback to wait for arc_prune_taskq to finish. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4687 Closes #4690	2016-05-25 14:11:53 -07:00
Chunwei Chen	a9bb2b6827	Use cv_timedwait_sig_hires in arc_reclaim_thread The was originally using interruptible cv_timedwait_sig, but was changed to uninterruptible cv_timedwait_hires in `ae6d0c6`. Use _sig_hires instead to allow interruptible sleep. Signed-off-by: Chunwei Chen <david.chen@osnexus.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4633 Closes #4634	2016-05-12 14:56:47 -07:00
David Quigley	ae6d0c601e	OpenZFS 6672 - arc_reclaim_thread() should use gethrtime() 6672 arc_reclaim_thread() should use gethrtime() instead of ddi_get_lbolt() Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: David Quigley <dpquigl@davequigley.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/6672 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/571be5c Closes #4600	2016-05-06 09:35:52 -07:00
Brian Behlendorf	8a09d5fd46	Add l2arc_max_block_size tunable Set a limit for the largest compressed block which can be written to an L2ARC device. By default this limit is set to 16M so there is no change in behavior. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #4323	2016-02-25 09:44:00 -08:00
Tim Chase	0a1f8cd999	Set arc_c_min properly in userland builds Since it's set to arc_c_max / 2, it must be set after arc_c_max is set. Also added protection against it falling below 2 * maxblocksize in userland builds. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4268	2016-01-25 10:25:10 -08:00
Tim Chase	1b8951b319	Prevent arc_c collapse Adjusting arc_c directly is racy because it can happen in the context of multiple threads. It should always be >= 2 * maxblocksize. Set it to a known valid value rather than adjusting it directly. In addition refactor arc_shrink() to a simpler structure, protect against underflow in the calculation of the new arc_c value. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reverts: `935434ef` Closes: #3904 Closes: #4161	2016-01-25 10:25:04 -08:00
Matthew Ahrens	7f60329a26	Illumos 5987 - zfs prefetch code needs work 5987 zfs prefetch code needs work Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> References: https://www.illumos.org/issues/5987 zfs prefetch code needs work illumos/illumos-gate@cf6106c 5987 zfs prefetch code needs work Porting notes: - [module/zfs/dbuf.c] - `5f6d0b6` Handle block pointers with a corrupt logical size - [module/zfs/dmu_zfetch.c] - `c65aa5b` Fix gcc missing parenthesis warnings - `428870f` Update core ZFS code from build 121 to build 141. - `79c76d5` Change KM_PUSHPAGE -> KM_SLEEP - `b8d06fc` Switch KM_SLEEP to KM_PUSHPAGE - Account for ISO C90 - mixed declarations and code - warnings - Module parameters (new/changed): - Replaced zfetch_block_cap with zfetch_max_distance (Max bytes to prefetch per stream (default 8MB; 8 * 1024 * 1024)) - Preserved zfs_prefetch_disable as 'int' for consistency with existing Linux module options. - [include/sys/trace_arc.h] - Added new tracepoints - DEFINE_ARC_BUF_HDR_EVENT(zfs_arc__sync__wait__for__async); - DEFINE_ARC_BUF_HDR_EVENT(zfs_arc__demand__hit__predictive__prefetch); - [man/man5/zfs-module-parameters.5] - Updated man page Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-12 09:02:33 -08:00
Brian Behlendorf	ab5cbbd107	Illumos 6293 - ztest failure: error == 28 (0xc == 0x1c) in ztest_tx_assign() 6293 ztest failure: error == 28 (0xc == 0x1c) in ztest_tx_assign() Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/6293 https://github.com/illumos/illumos-gate/commit/8fe00bf Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2016-01-11 14:10:31 -08:00
Brian Behlendorf	1cdb86cba2	Handle block pointers with a corrupt logical size Commit `5f6d0b6` was originally added to gracefully handle block pointers with a damaged logical size. However, it incorrectly assumed that all passed arc_done_func_t could handle a NULL arc_buf_t. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4069 Closes #4080	2015-12-15 16:11:44 -08:00
tuxoko	43518d92fd	Fix zfs_dirty_data_max overflow on 32-bit On 32 bit, the calculation of zfs_dirty_data_max from phymem will overflow, causing it to be smaller than zfs_dirty_data_sync, and will cause txg being delayed while no one write to disk. The end result is horrendous write speed. On 4G ram 32-bit VM, before this patch, simple dd results in ~7MB/s. Now it can reach speed on par with 64-bit VM. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3973	2015-11-19 16:02:47 -08:00
tuxoko	d0c614ecf9	Fix null pointer in arc_kmem_reap_now on 32-bit On 32 bit system, zio_buf_cache is limit to 1M. Larger than that is all NULL. So we need to avoid reaping them. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3973	2015-11-19 16:01:47 -08:00
AndCycle	256fa983f4	Obey arc_meta_limit default size when changing arc_max When decreasing the maximum ARC size preserve the 3/4 default ratio for the arc_meta_limit. Otherwise, the arc_meta_limit may be set the same as arc_max. Signed-off-by: AndCycle <andcycle@andcycle.idv.tw> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4001	2015-11-13 15:45:22 -08:00
Brian Behlendorf	935434ef01	Fix 'arc_c < arc_c_min' panic Strictly enforce keeping 'arc_c >= arc_c_min'. The ASSERTs are left in place to catch this in a debug build but logic has been added to gracefully handle in a production build. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3904	2015-10-13 09:23:35 -07:00
Brian Behlendorf	ef5b2e1048	Avoid blocking in arc_reclaim_thread() As described in the comment above arc_reclaim_thread() it's critical that the reclaim thread be careful about blocking. Just like it must never wait on a hash lock, it must never wait on a task which can in turn wait on the CV in arc_get_data_buf(). This will deadlock, see issue #3822 for full backtraces showing the problem. To resolve this issue arc_kmem_reap_now() has been updated to use the asynchronous arc prune function. This means that arc_prune_async() may now be called while there are still outstanding arc_prune_tasks. However, this isn't a problem because arc_prune_async() already keeps a reference count preventing multiple outstanding tasks per registered consumer. Functionally, this behavior is the same as the counterpart illumos function dnlc_reduce_cache(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Issue #3808 Issue #3834 Issue #3822	2015-09-25 12:45:47 -07:00
Arne Jansen	4e0f33ffe0	Illumos 6214 - zpools going south 6214 zpools going south Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> References: https://www.illumos.org/issues/6214 http://cr.illumos.org/~webrev/sensille/6214_zpools_going_south/ Porting Notes: Reintroduce b_compress to the l2arc_buf_hdr_t. In commit `b9541d6` the compression flags were moved to the generic b_flags in the arc_buf_hdr_t. This is a problem because l2arc_compress_buf() may manipulate the compression flags and this can only be done safely under the hash lock which is not held. See Illumos 6214 for a detailed analysis of the race. HDR_GET_COMPRESS() macro was removed from arc_buf_info(). Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3757	2015-09-11 11:14:38 -07:00
Brian Behlendorf	7e8bddd019	Update arc_memory_throttle() to check pageout This brings the behavior of arc_memory_throttle() back in sync with illumos. The updated memory throttling policy roughly goes like this: * Never throttle if more than 10% of memory is free. This threshold is configurable with the zfs_arc_lotsfree_percent module option. * Minimize any throttling of kswapd even when free memory is below the set threshold. Allow it to write out pages as quickly as possible to help alleviate the memory pressure. * Delay all other threads when free memory is below the set threshold in order to avoid compounding the memory pressure. Buffers will be evicted from the ARC to reduce the issue. The Linux specific zfs_arc_memory_throttle_disable module option has been removed in favor of the existing zfs_arc_lotsfree_percent tuning. Setting zfs_arc_lotsfree_percent=0 will have the same effect as zfs_arc_memory_throttle_disable and it was therefore redundant. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3637	2015-07-30 11:52:12 -07:00
Brian Behlendorf	11f552fa90	Update arc_available_memory() to check freemem While Linux doesn't provide detailed information about the state of the VM it does provide us total free pages. This information should be incorporated in to the arc_available_memory() calculation rather than solely relying on a signal from direct reclaim. Conceptually this brings arc_available_memory() back in sync with illumos. It is also desirable that the target amount of free memory be tunable on a system. While the default values are expected to work well for most workloads there may be cases where custom values are needed. The zfs_arc_sys_free module option was added for this purpose. zfs_arc_sys_free - The target number of bytes the ARC should leave as free memory on the system. This value can checked in /proc/spl/kstat/zfs/arcstats and setting this module option will override the default value. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3637	2015-07-30 11:50:22 -07:00
Brian Behlendorf	1229323d5f	Align thread priority with Linux defaults Under Linux filesystem threads responsible for handling I/O are normally created with the maximum priority. Non-I/O filesystem processes run with the default priority. ZFS should adopt the same priority scheme under Linux to maintain good performance and so that it will complete fairly when other Linux filesystems are active. The priorities have been updated to the following: $ ps -eLo rtprio,cls,pid,pri,nice,cmd \| egrep 'z_\|spl_\|zvol\|arc\|dbu\|meta' - TS 10743 19 -20 [spl_kmem_cache] - TS 10744 19 -20 [spl_system_task] - TS 10745 19 -20 [spl_dynamic_tas] - TS 10764 19 0 [dbu_evict] - TS 10765 19 0 [arc_prune] - TS 10766 19 0 [arc_reclaim] - TS 10767 19 0 [arc_user_evicts] - TS 10768 19 0 [l2arc_feed] - TS 10769 39 0 [z_unmount] - TS 10770 39 -20 [zvol] - TS 11011 39 -20 [z_null_iss] - TS 11012 39 -20 [z_null_int] - TS 11013 39 -20 [z_rd_iss] - TS 11014 39 -20 [z_rd_int_0] - TS 11022 38 -19 [z_wr_iss] - TS 11023 39 -20 [z_wr_iss_h] - TS 11024 39 -20 [z_wr_int_0] - TS 11032 39 -20 [z_wr_int_h] - TS 11033 39 -20 [z_fr_iss_0] - TS 11041 39 -20 [z_fr_int] - TS 11042 39 -20 [z_cl_iss] - TS 11043 39 -20 [z_cl_int] - TS 11044 39 -20 [z_ioctl_iss] - TS 11045 39 -20 [z_ioctl_int] - TS 11046 39 -20 [metaslab_group_] - TS 11050 19 0 [z_iput] - TS 11121 38 -19 [z_wr_iss] Note that under Linux the meaning of a processes priority is inverted with respect to illumos. High values on Linux indicate a _low_ priority while high value on illumos indicate a _high_ priority. In order to preserve the logical meaning of the minclsyspri and maxclsyspri macros when they are used by the illumos wrapper functions their values have been inverted. This way when changes are merged from upstream illumos we won't need to remember to invert the macro. It could also lead to confusion. This patch depends on https://github.com/zfsonlinux/spl/pull/466. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #3607	2015-07-28 13:36:47 -07:00
Brian Behlendorf	96c080cb9c	Minor style cleanup Address minor differences in style between upstream and ZoL. This patch contains no functional differences and is solely designed to minimize the delta from upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:54 -07:00
Brian Behlendorf	3056818343	Remove double counting HDR_L2ONLY_SIZE Commit `d962d5d` didn't quite properly resolve the HDR_L2ONLY_SIZE accounting. Accounting is now performed only in the constructor and destructor which is a nice simplification. It should have been removed the from create and destroy functions. This brings up back in sync with upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:44 -07:00
Brian Behlendorf	8c8af9d807	Add hdr_recl() reclaim callback Originally removed because it wasn't required under Linux. However, there may still be some utility in signaling the arc reclaim thread under Linux via reclaim. This should already have happened by other means but it's not harmless and reduces another point of divergence with upstream. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:40 -07:00
Brian Behlendorf	728d6ae91e	Reinstate zfs_arc_p_min_shift Commit `f521ce1` removed the minimum value for "arc_p" allowing it to drop to zero or grow to "arc_c". This was done to improve specific workload which constantly dirties new "metadata" but also frequently touches a "small" amount of mfu data (e.g. mkdir's). This change may still be desirable but it needs to be re-investigated. in the context of the recent ARC changes from upstream. Therefore this code is being restored to facilitate benchmarking. By setting "zfs_arc_p_min_shift=64" we easily compare the performance. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:32 -07:00
Prakash Surya	36da08ef9b	Illumos 5817 - change type of arcs_size from uint64_t to refcount_t 5817 change type of arcs_size from uint64_t to refcount_t Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/5817 https://github.com/illumos/illumos-gate/commit/2fd872a Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:28 -07:00
Prakash Surya	500445c046	Illumos 5445 - Add more visibility via arcstats 5445 Add more visibility via arcstats; specifically arc_state_t stats and differentiate between "data" and "metadata" Reviewed by: Basil Crow <basil.crow@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Bayard Bell <bayard.bell@nexenta.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5445 https://github.com/illumos/illumos-gate/commit/4076b1b Porting Notes: This patch is an improved version of `cc7f677` which was previously merged in ZoL. This patch incorporates the additional improvements which were made upstream. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3533	2015-07-23 09:42:06 -07:00
Matthew Ahrens	ca67b33aba	Illumos 5376 - arc_kmem_reap_now() should not result in clearing arc_no_grow 5376 arc_kmem_reap_now() should not result in clearing arc_no_grow Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5376 https://github.com/illumos/illumos-gate/commit/2ec99e3 Porting Notes: The good news is that many of the recent changes made upstream to the ARC tackled issues previously observed by ZoL with similar solutions. The bad news is those solution weren't identical to the ones we applied. This patch is designed to split the difference and apply as much of the upstream work as possible. * The arc_available_memory() function was removed previous in ZoL but due to the upstream changes it makes sense to add it back. This function has been customized for Linux so that it can be used to determine a low memory. This provides the same basic functionality as the illumos version allowing us to minimize changes through the rest of the code base. The exact mechanism used to detect a low memory state remains unchanged so this change isn't a significant as it might first appear. * This patch includes the long standing fix for arc_shrink() which was originally proposed in #2167. Since there were related changes to this function it made sense to include that work. * The arc_init() function has been re-factored. As before it sets sane default values for the ARC but then calls arc_tuning_update() to apply user specific tuning made via module options. The arc_tuning_update() function is then called periodically by the arc_reclaim_thread() to apply changes to the tunings made during normal operation. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3616 Closes #2167	2015-07-23 09:41:28 -07:00
Alek Pinchuk	a7b10a9319	Illumos 6033 - arc_adjust() should search MFU lists 6033 arc_adjust() should search MFU lists for oldest buffer when adjusting MFU size Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Xin Li <delphij@delphij.net> Reviewed by: Prakash Surya <me@prakashsurya.com> Approved by: Matthew Ahrens <mahrens@delphix.com> References: https://www.illumos.org/issues/6033 https://github.com/illumos/illumos-gate/commit/31c46cf Ported-by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3545	2015-07-01 11:09:15 -07:00
Matthew Ahrens	c52fca13a0	Illumos 5368 - ARC should cache more metadata 5368 ARC should cache more metadata Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5368 https://github.com/illumos/illumos-gate/commit/3a5286a Porting Notes: The vast majority of this patch was already merged in the context of the `06358ea` changes. This is just a small hunk which was missed. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-25 08:58:17 -07:00
George Wilson	669dedb33f	Illumos 5163 - arc should reap range_seg_cache 5163 arc should reap range_seg_cache Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5163 https://github.com/illumos/illumos-gate/commit/83803b5 Porting Notes: Added umem_cache_reap_now() wrapped to suppress unused variable warning for user space build in arc_kmem_reap_now(). Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-25 08:58:16 -07:00
Brian Behlendorf	aa9af22cdf	Update all default taskq settings Over the years the default values for the taskqs used on Linux have differed slightly from illumos. In the vast majority of cases this was done to avoid creating an obnoxious number of idle threads which would pollute the process listing. With the addition of support for dynamic taskqs all multi-threaded queues should be created as dynamic taskqs. This allows us to get the best of both worlds. * The illumos default values for the I/O pipeline can be restored. These values are known to work well for most workloads. The only exception is the zio write interrupt taskq which is changed to ZTI_P(12, 8). At least under Linux more threads has been shown to improve performance, see commit `7e55f4e`. * Reduces the number of idle threads on the system when it's not under heavy load. The maximum number of threads will only be created when they are required. * Remove the vdev_file_taskq and rely on the system_taskq instead which is now dynamic and may have up to 64-threads. Again this brings us back inline with upstream. * Tasks dispatched with taskq_dispatch_ent() are allowed to use dynamic taskqs. The Linux taskq implementation supports this. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #3507	2015-06-25 08:58:16 -07:00
Andriy Gapon	ef56b0780c	Account for ashift when gathering buffers to be written to l2arc device If we don't account for that, then we might end up overwriting disk area of buffers that have not been evicted yet, because l2arc_evict operates in terms of disk addresses. The discrepancy between the write size calculation and the actual increment to l2ad_hand was introduced in commit `3a17a7a9`. The change that introduced l2ad_hand alignment was almost correct as the write size was accumulated as a sum of rounded buffer sizes. See commit illumos/illumos-gate@e14bb32. Also, we now consistently use asize / a_sz for the allocated size and psize / p_sz for the physical size. The latter accounts for a possible size reduction because of the compression, whereas the former accounts for a possible subsequent size expansion because of the alignment requirements. The code still assumes that either underlying storage subsystems or hardware is able to do read-modify-write when an L2ARC buffer size is not a multiple of a disk's block size. This is true for 4KB sector disks that provide 512B sector emulation, but may not be true in general. In other words, we currently do not have any code to make sure that an L2ARC buffer, whether compressed or not, which is used for physical I/O has a suitable size. Note that currently the cache device utilization is calculated based on the physical size, not the allocated size. The same applies to l2_asize kstat. That is wrong, but this commit does not fix that. The accounting problem was introduced partially in commit `3a17a7a9` and partially in 3038a2b (accounting became consistent but in favour of the wrong size). Porting Notes: Reworked to be C90 compatible and the 'write_psize' variable was removed because it is now unused. References: https://reviews.csiden.org/r/229/ https://reviews.freebsd.org/D2764 Ported-by: kernelOfTruth <kerneloftruth@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3400 Closes #3433 Closes #3451	2015-06-25 08:57:16 -07:00
Prakash Surya	d962d5dad9	Illumos 5701 - zpool list reports incorrect "alloc" value for cache devices 5701 zpool list reports incorrect "alloc" value for cache devices Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5701 https://github.com/illumos/illumos-gate/commit/a52fc31 Porting Notes: arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS); correctly placed at arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr). Ported by: kernelOfTruth kerneloftruth@gmail.com Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-25 08:51:44 -07:00
Brian Behlendorf	b64ccd6c52	Rename cv_wait_interruptible() to cv_wait_sig() This is the counterpart to zfsonlinux/spl@2345368 which replaces the cv_wait_interruptible() function with cv_wait_sig(). There is no functional change to patch merely brings the function names in to sync to maximize portability. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #3450 Issue #3402	2015-06-11 10:50:47 -07:00
Tim Chase	121b3cae74	Increase arc_c_min to allow safe operation of arc_adapt() ZoL had lowered the minimum ARC size to 4MiB to better accommodate tiny systems such as the raspberry pi, however, as of addition of large block support, the arc_adapt() function depends on arc_c being >= 32MiB (2 * SPA_MAXBLOCKSIZE). This patch raises the minimum ARC size to 32MiB and adds a VERIFY test to arc_adapt() for future-proofing. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
Brian Behlendorf	f604673836	Make arc_prune() asynchronous As described in the comment above arc_adapt_thread() it is critical that the arc_adapt_thread() function never sleep while holding a hash lock. This behavior was possible in the Linux implementation because the arc_prune() logic was implemented to be synchronous. Under illumos the analogous dnlc_reduce_cache() function is asynchronous. To address this the arc_do_user_prune() function is has been reworked in to two new functions as follows: * arc_prune_async() is an asynchronous implementation which dispatches the prune callback to be run by the system taskq. This makes it suitable to use in the context of the arc_adapt_thread(). * arc_prune() is a synchronous implementation which depends on the arc_prune_async() implementation but blocks until the outstanding callbacks complete. This is used in arc_kmem_reap_now() where it is safe, and expected, that memory will be freed. This patch additionally adds the zfs_arc_meta_strategy module option while allows the meta reclaim strategy to be configured. It defaults to a balanced strategy which has been proved to work well under Linux but the illumos meta-only strategy can be enabled. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
Prakash Surya	ca0bf58d65	Illumos 5497 - lock contention on arcs_mtx Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> Porting notes and other significant code changes: The illumos 5368 patch (ARC should cache more metadata), which was never picked up by ZoL, is mostly reverted by this patch. Since ZoL relies on the kernel asynchronously calling the shrinker to actually reap memory, the shrinker wakes up arc_reclaim_waiters_cv every time it runs. The arc_adapt_thread() function no longer calls arc_do_user_evicts() since the newly-added arc_user_evicts_thread() calls it periodically. Notable conflicting ZoL commits which conflicted with this patch or whose effects are either duplicated or un-done by this patch: `302f753` - Integrate ARC more tightly with Linux `39e055c` - Adjust arc_p based on "bytes" in arc_shrink `f521ce1` - Allow "arc_p" to drop to zero or grow to "arc_c" `77765b5` - Remove "arc_meta_used" from arc_adjust calculation `94520ca` - Prune metadata from ghost lists in arc_adjust_meta Trace support for multilist_insert() and multilist_remove() has been added and produces the following output: fio-12498 [077] .... 112936.448324: zfs_multilist__insert: ml { offset 240 numsublists 80 sublistidx 63 } fio-12498 [077] .... 112936.448347: zfs_multilist__remove: ml { offset 240 numsublists 80 sublistidx 29 } The following arcstats have been removed: recycle_miss - Used by arcstat.py and arc_summary.py, both of which have been updated appropriately. l2_writes_hdr_miss The following arcstats have been added: evict_not_enough - Number of times arc_evict_state() was unable to evict enough buffers to reach its target amount. evict_l2_skip - Number of times arc_evict_hdr() skipped eviction because it was being written to the l2arc. l2_writes_lock_retry - Replaces l2_writes_hdr_miss. Number of times l2arc_write_done() failed to acquire hash_lock (and re-tries). arc_meta_min - Shows the value of the zfs_arc_meta_min module parameter (see below). The "index" column of the "dbuf" kstat has been removed since it doesn't have a direct analog in the new multilist scheme. Additional multilist- related stats could be added in the future but would likely require extensions to the mulilist API. The following module parameters have been added: zfs_arc_evict_batch_limit - Number of ARC headers to free per sub-list before moving on to the next sub-list. zfs_arc_meta_min - Enforce a floor on the amount of metadata in the ARC. zfs_arc_num_sublists_per_state - Number of multilist sub-lists per ARC state. zfs_arc_overflow_shift - Controls amount by which the ARC must exceed the target size to be considered "overflowing". Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov	2015-06-11 10:27:25 -07:00
Chris Williamson	b9541d6b7d	Illumos 5408 - managing ZFS cache devices requires lots of RAM 5408 managing ZFS cache devices requires lots of RAM Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Don Brady <dev.fs.zfs@gmail.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> Porting notes: Due to the restructuring of the ARC-related structures, this patch conflicts with at least the following existing ZoL commits: `6e1d7276c9` Fix inaccurate arcstat_l2_hdr_size calculations The ARC_SPACE_HDRS constant no longer exists and has been somewhat equivalently replaced by HDR_L2ONLY_SIZE. `e0b0ca983d` Add visibility in to cached dbufs The new layering of l{1,2}arc_buf_hdr_t within the arc_buf_hdr struct requires additional structure member names to be used when referencing the inner items. Also, the presence of L1 or L2 inner member is indicated by flags using the new HDR_HAS_L{1,2}HDR macros. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
George Wilson	2a4324141f	Illumos 5369 - arc flags should be an enum 5369 arc flags should be an enum 5370 consistent arc_buf_hdr_t naming scheme Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Richard Lowe <richlowe@richlowe.net> Porting notes: ZoL has moved some ARC definitions into arc_impl.h. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Ported by: Tim Chase <tim@chase2k.com>	2015-06-11 10:27:25 -07:00
Tim Chase	ad4af89561	Partially revert "Add ddt, ddt_entry, and l2arc_hdr caches" This reverts only the l2arc_hdr part of commit `ecf3d9b8e6` in preparation for the illumos 5497 "lock contention on arcs_mtx" patch which does the same thing but uses the newer two-level ARC structure following the Illumos 5408 "managing ZFS cache devices requires lots of RAM" patch. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
Tim Chase	97639d0a52	Revert "Allow arc_evict_ghost() to only evict meta data" Illumos 5497 "lock contention on arcs_mtx" reworks eviction and obviates the need for this. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:25 -07:00
Tim Chase	f6b3b1f5d6	Revert "fix l2arc compression buffers leak" This reverts commit `037763e44e` in preparation for the illumos 5497 "lock contention on arcs_mtx" patch which includes a fix for this very problem. ZoL had picked up a subset of the illumos 5497 patch to deal with the l2arc compression buffer leak. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:24 -07:00
Tim Chase	7807028ccd	Revert "arc_evict, arc_evict_ghost: reduce stack usage using kmem_zalloc" This reverts commit `16fcdea363` in preparation for the illumos 5497 "lock contention on arcs_mtx" patch which eliminates "marker" within the ARC code. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2015-06-11 10:27:24 -07:00
Matthew Ahrens	f1512ee61e	Illumos 5027 - zfs large block support 5027 zfs large block support Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/5027 https://github.com/illumos/illumos-gate/commit/b515258 Porting Notes: * Included in this patch is a tiny ISP2() cleanup in zio_init() from Illumos 5255. * Unlike the upstream Illumos commit this patch does not impose an arbitrary 128K block size limit on volumes. Volumes, like filesystems, are limited by the zfs_max_recordsize=1M module option. * By default the maximum record size is limited to 1M by the module option zfs_max_recordsize. This value may be safely increased up to 16M which is the largest block size supported by the on-disk format. At the moment, 1M blocks clearly offer a significant performance improvement but the benefits of going beyond this for the majority of workloads are less clear. * The illumos version of this patch increased DMU_MAX_ACCESS to 32M. This was determined not to be large enough when using 16M blocks because the zfs_make_xattrdir() function will fail (EFBIG) when assigning a TX. This was immediately observed under Linux because all newly created files must have a security xattr created and that was failing. Therefore, we've set DMU_MAX_ACCESS to 64M. * On 32-bit platforms a hard limit of 1M is set for blocks due to the limited virtual address space. We should be able to relax this one the ABD patches are merged. Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #354	2015-05-11 12:23:16 -07:00
Chris Dunlop	16fcdea363	arc_evict, arc_evict_ghost: reduce stack usage using kmem_zalloc With debugging enabled and depending on your kernel config, the size of arc_buf_hdr_t can blow out the stack of arc_evict() and arc_evict_ghost() to greater than 1024 bytes. Let's avoid this. Signed-off-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3377	2015-05-08 14:14:35 -07:00
Tim Chase	40d06e3c78	Mark all ZPL and ioctl functions as PF_FSTRANS Prevent deadlocks by disabling direct reclaim during all ZPL and ioctl calls as well as the l2arc and adapt ARC threads. This obviates the need for MUTEX_FSTRANS so its previous uses and definition have been eliminated. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3225	2015-04-03 11:38:59 -07:00
Brian Behlendorf	bc88866657	Fix arc_adjust_meta() behavior The goal of this function is to evict enough meta data buffers from the ARC in order to enforce the arc_meta_limit. Achieving this is slightly more complicated than it appears because it is common for data buffers to have holds on meta data buffers. In addition, dnode meta data buffers will be held by the dnodes in the block preventing them from being freed. This means we can't simply traverse the ARC and expect to always find enough unheld meta data buffer to release. Therefore, this function has been updated to make alternating passes over the ARC releasing data buffers and then newly unheld meta data buffers. This ensures forward progress is maintained and arc_meta_used will decrease. Normally this is sufficient, but if required the ARC will call the registered prune callbacks causing dentry and inodes to be dropped from the VFS cache. This will make dnode meta data buffers available for reclaim. The number of total restarts in limited by zfs_arc_meta_adjust_restarts to prevent spinning in the rare case where all meta data is pinned. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue #3160	2015-03-20 10:35:20 -07:00
Brian Behlendorf	2cbb06b561	Restructure per-filesystem reclaim Originally when the ARC prune callback was introduced the idea was to register a single callback for the ZPL. The ARC could invoke this call back if it needed the ZPL to drop dentries, inodes, or other cache objects which might be pinning buffers in the ARC. The ZPL would iterate over all ZFS super blocks and perform the reclaim. For the most part this design has worked well but due to limitations in 2.6.35 and earlier kernels there were some problems. This patch is designed to address those issues. 1) iterate_supers_type() is not provided by all kernels which makes it impossible to safely iterate over all zpl_fs_type filesystems in a single callback. The most straight forward and portable way to resolve this is to register a callback per-filesystem during mount. The arc_*_prune_callback() functions have always supported multiple callbacks so this is functionally a very small change. 2) Commit `050d22b` removed the non-portable shrink_dcache_memory() and shrink_icache_memory() functions and didn't replace them with equivalent functionality. This meant that for Linux 3.1 and older kernels the ARC had no mechanism to drop dentries and inodes from the caches if needed. This patch adds that missing functionality by calling shrink_dcache_parent() to release dentries which may be pinning inodes. This will result in all unused cache entries being dropped which is a bit heavy handed but it's the only interface available for old kernels. 3) A zpl_drop_inode() callback is registered for kernels older than 2.6.35 which do not support the .evict_inode callback. This ensures that when the last reference on an inode is dropped it is immediately removed from the cache. If this isn't done than inode can end up on the global unused LRU with no mechanism available to ZFS to drop them. Since the ARC buffers are not dropped the hottest inodes can still be recreated without performing disk IO. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Pavel Snajdr <snajpa@snajpa.net> Issue #3160	2015-03-20 10:35:20 -07:00

1 2 3 4

164 Commits