mirror_zfs

mirror of https://git.proxmox.com/git/mirror_zfs.git synced 2024-12-27 03:19:35 +03:00

Author	SHA1	Message	Date
Brian Behlendorf	e26ade5101	Fix zvol+btrfs hang When using a zvol to back a btrfs filesystem the btrfs mount would hang. This was due to the bio completion callback used in btrfs assuming that lower level drivers would never modify the bio->bi_io_vecs after they were submitted via bio_submit(). If they are modified btrfs will miscalculate which pages need to be unlocked resulting in a hang. It's worth mentioning that other file systems such as ext[234] and xfs work fine because they do not make the same assumption in the bio completion callback. The most straight forward way to fix the issue is to present the semantics expected by btrfs. This is done by cloning the bios attached to each request and then using the clones bvecs to perform the required accounting. The clones are freed after each read/write and the original unmodified bios are linked back in to the request. Signed-off-by: Chris Wedgwood <cw@f00f.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #469	2012-11-09 12:24:51 -08:00
Etienne Dechamps	920dd524fb	Add FASTWRITE algorithm for synchronous writes. Currently, ZIL blocks are spread over vdevs using hint block pointers managed by the ZIL commit code and passed to metaslab_alloc(). Spreading log blocks accross vdevs is important for performance: indeed, using mutliple disks in parallel decreases the ZIL commit latency, which is the main performance metric for synchronous writes. However, the current implementation suffers from the following issues: 1) It would be best if the ZIL module was not aware of such low-level details. They should be handled by the ZIO and metaslab modules; 2) Because the hint block pointer is managed per log, simultaneous commits from multiple logs might use the same vdevs at the same time, which is inefficient; 3) Because dmu_write() does not honor the block pointer hint, indirect writes are not spread. The naive solution of rotating the metaslab rotor each time a block is allocated for the ZIL or dmu_sync() doesn't work in practice because the first ZIL block to be written is actually allocated during the previous commit. Consequently, when metaslab_alloc() decides the vdev for this block, it will do so while a bunch of other allocations are happening at the same time (from dmu_sync() and other ZILs). This means the vdev for this block is chosen more or less at random. When the next commit happens, there is a high chance (especially when the number of blocks per commit is slightly less than the number of the disks) that one disk will have to write two blocks (with a potential seek) while other disks are sitting idle, which defeats spreading and increases the commit latency. This commit introduces a new concept in the metaslab allocator: fastwrites. Basically, each top-level vdev maintains a counter indicating the number of synchronous writes (from dmu_sync() and the ZIL) which have been allocated but not yet completed. When the metaslab is called with the FASTWRITE flag, it will choose the vdev with the least amount of pending synchronous writes. If there are multiple vdevs with the same value, the first matching vdev (starting from the rotor) is used. Once metaslab_alloc() has decided which vdev the block is allocated to, it updates the fastwrite counter for this vdev. The rationale goes like this: when an allocation is done with FASTWRITE, it "reserves" the vdev until the data is written. Until then, all future allocations will naturally avoid this vdev, even after a full rotation of the rotor. As a result, pending synchronous writes at a given point in time will be nicely spread over all vdevs. This contrasts with the previous algorithm, which is based on the implicit assumption that blocks are written instantaneously after they're allocated. metaslab_fastwrite_mark() and metaslab_fastwrite_unmark() are used to manually increase or decrease fastwrite counters, respectively. They should be used with caution, as there is no per-BP tracking of fastwrite information, so leaks and "double-unmarks" are possible. There is, however, an assert in the vdev teardown code which will fire if the fastwrite counters are not zero when the pool is exported or the vdev removed. Note that as stated above, marking is also done implictly by metaslab_alloc(). ZIO also got a new FASTWRITE flag; when it is used, ZIO will pass it to the metaslab when allocating (assuming ZIO does the allocation, which is only true in the case of dmu_sync). This flag will also trigger an unmark when zio_done() fires. A side-effect of the new algorithm is that when a ZIL stops being used, its last block can stay in the pending state (allocated but not yet written) for a long time, polluting the fastwrite counters. To avoid that, I've implemented a somewhat crude but working solution which unmarks these pending blocks in zil_sync(), thus guaranteeing that linguering fastwrites will get pruned at each sync event. The best performance improvements are observed with pools using a large number of top-level vdevs and heavy synchronous write workflows (especially indirect writes and concurrent writes from multiple ZILs). Real-life testing shows a 200% to 300% performance increase with indirect writes and various commit sizes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1013	2012-10-17 08:56:41 -07:00
Richard Yao	b8d06fca08	Switch KM_SLEEP to KM_PUSHPAGE Differences between how paging is done on Solaris and Linux can cause deadlocks if KM_SLEEP is used in any the following contexts. * The txg_sync thread * The zvol write/discard threads * The zpl_putpage() VFS callback This is because KM_SLEEP will allow for direct reclaim which may result in the VM calling back in to the filesystem or block layer to write out pages. If a lock is held over this operation the potential exists to deadlock the system. To ensure forward progress all memory allocations in these contexts must us KM_PUSHPAGE which disables performing any I/O to accomplish the memory allocation. Previously, this behavior was acheived by setting PF_MEMALLOC on the thread. However, that resulted in unexpected side effects such as the exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs code, but it is more consistent with the right way to handle these cases under Linux. This is patch lays the ground work for being able to safely revert the following commits which used PF_MEMALLOC: `21ade34` Disable direct reclaim for z_wr_* threads `cfc9a5c` Fix zpl_writepage() deadlock `eec8164` Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool)) Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #726	2012-08-27 12:01:37 -07:00
Brian Behlendorf	afec56b43f	Add zfs_mdcomp_disable module option Expose the zfs_mdcomp_disable variable as a module option. This can be used to disable compression of zfs meta data which is enabled by default. This shouldn't need to be tuned but for most workloads, however there may be very specific instances where it makes sense to trade disk capacity for extra cpu cycles. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-04-27 16:28:02 -07:00
Brian Behlendorf	570827e129	Add 'dmu_tx' kstats entry Keep counters for the various reasons that a thread may end up in txg_wait_open() waiting on a new txg. This can be useful when attempting to determine why a particular workload is under performing. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-27 08:59:10 -08:00
Alex Zhuravlev	a473d90cee	Export symbols for zero-copy Export additional symbols to make use of the DMU's zero-copy API. This allows external modules to move data in to and out of the ARC without incurring the cost of a memory copy. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-17 12:43:02 -08:00
Brian Behlendorf	b10c77f70a	Export symbols for zero-copy Exported the required symbols to make use of the DMU's zero-copy API. This allows external modules to move data in to and out of the ARC without incurring the cost of a memory copy. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2012-02-10 11:56:55 -08:00
Brian Behlendorf	4db77a74a6	Suppress large kmem_alloc() warning The following warning was observed under normal operation. It's not fatal but it's something to be addressed long term. Flag the offending allocation with KM_NODEBUG to suppress the warning and flag the call site. SPL: Showing stack for process 21761 Pid: 21761, comm: iozone Tainted: P ---------------- 2.6.32-71.14.1.el6.x86_64 #1 Call Trace: [<ffffffffa05465a7>] spl_debug_dumpstack+0x27/0x40 [spl] [<ffffffffa054a84d>] kmem_alloc_debug+0x11d/0x130 [spl] [<ffffffffa05de166>] dmu_buf_hold_array_by_dnode+0xa6/0x4e0 [zfs] [<ffffffffa05de825>] dmu_buf_hold_array+0x65/0x90 [zfs] [<ffffffffa05de891>] dmu_read_uio+0x41/0xd0 [zfs] [<ffffffffa0654827>] zfs_read+0x147/0x470 [zfs] [<ffffffffa06644a2>] zpl_read_common+0x52/0x70 [zfs] [<ffffffffa0664503>] zpl_read+0x43/0x70 [zfs] [<ffffffff8116d905>] vfs_read+0xb5/0x1a0 [<ffffffff8116da41>] sys_read+0x51/0x90 [<ffffffff81013172>] system_call_fastpath+0x16/0x1b	2011-02-10 09:27:22 -08:00
Brian Behlendorf	6149f4c45f	Remove dmu_write_pages() support For the moment we do not use dmu_write_pages() to write pages directly in to a dmu object. It may be required at some point in the future, but for now is simplest and cleanest to drop it. It can be easily readded if/when needed.	2011-02-10 09:27:21 -08:00
Brian Behlendorf	872e8d2697	Add initial rw_uio functions to the dmu These functions were dropped originally because I felt they would need to be rewritten anyway to avoid using uios. However, this patch readds then with they dea they can just be reworked and the uio bits dropped.	2011-02-04 16:14:34 -08:00
Brian Behlendorf	c28b227942	Add linux kernel module support Setup linux kernel module support, this includes: - zfs context for kernel/user - kernel module build system integration - kernel module macros - kernel module symbol export - kernel module options Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 13:41:58 -07:00
Brian Behlendorf	60101509ee	Add linux kernel disk support Native Linux vdev disk interfaces Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 13:41:57 -07:00
Brian Behlendorf	59e6e7ca85	Fix kstat xuio Move xiou stat structures from a header to the dmu.c source as is done with all the other kstat interfaces. This information is local to dmu.c registered the xuio kstat and should stay that way. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:45 -07:00
Brian Behlendorf	d4ed667343	Fix gcc uninitialized variable warnings Gcc -Wall warn: 'uninitialized variable' Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-31 08:38:43 -07:00
Brian Behlendorf	d6320ddb78	Fix gcc c90 compliance warnings Fix non-c90 compliant code, for the most part these changes simply deal with where a particular variable is declared. Under c90 it must alway be done at the very start of a block. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2010-08-27 15:28:32 -07:00
Brian Behlendorf	572e285762	Update to onnv_147 This is the last official OpenSolaris tag before the public development tree was closed.	2010-08-26 14:24:34 -07:00
Brian Behlendorf	428870ff73	Update core ZFS code from build 121 to build 141.	2010-05-28 13:45:14 -07:00
Brian Behlendorf	45d1cae3b8	Rebase master to b121	2009-08-18 11:43:27 -07:00
Brian Behlendorf	9babb37438	Rebase master to b117	2009-07-02 15:44:48 -07:00
Brian Behlendorf	172bb4bd5e	Move the world out of /zfs/ and seperate out module build tree	2008-12-11 11:08:09 -08:00

20 Commits