The read bandwidth of an N-way mirror can by increased by 50%,
and the IOPs by 10%, by more carefully selecting the preferred
leaf vdev.
The existing algorthm selects a perferred leaf vdev based on
offset of the zio request modulo the number of members in the
mirror. It assumes the drives are of equal performance and
that spreading the requests randomly over both drives will be
sufficient to saturate them. In practice this results in the
leaf vdevs being under utilized.
Utilization can be improved by preferentially selecting the leaf
vdev with the least pending IO. This prevents leaf vdevs from
being starved and compensates for performance differences between
disks in the mirror. Faster vdevs will be sent more work and
the mirror performance will not be limitted by the slowest drive.
In the common case where all the pending queues are full and there
is no single least busy leaf vdev a batching stratagy is employed.
Of the N least busy vdevs one is selected with equal probability
to be the preferred vdev for T microseconds. Compared to randomly
selecting a vdev to break the tie batching the requests greatly
improves the odds of merging the requests in the Linux elevator.
The testing results show a significant performance improvement
for all four workloads tested. The workloads were generated
using the fio benchmark and are as follows.
1) 1MB sequential reads from 16 threads to 16 files (MB/s).
2) 4KB sequential reads from 16 threads to 16 files (MB/s).
3) 1MB random reads from 16 threads to 16 files (IOP/s).
4) 4KB random reads from 16 threads to 16 files (IOP/s).
| Pristine | With 1461 |
| Sequential Random | Sequential Random |
| 1MB 4KB 1MB 4KB | 1MB 4KB 1MB 4KB |
| MB/s MB/s IO/s IO/s | MB/s MB/s IO/s IO/s |
---------------+-----------------------+------------------------+
2 Striped | 226 243 11 304 | 222 255 11 299 |
2 2-Way Mirror | 302 324 16 534 | 433 448 23 571 |
2 3-Way Mirror | 429 458 24 714 | 648 648 41 808 |
2 4-Way Mirror | 562 601 36 849 | 816 828 82 926 |
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1461
Differences between how paging is done on Solaris and Linux can cause
deadlocks if KM_SLEEP is used in any the following contexts.
* The txg_sync thread
* The zvol write/discard threads
* The zpl_putpage() VFS callback
This is because KM_SLEEP will allow for direct reclaim which may result
in the VM calling back in to the filesystem or block layer to write out
pages. If a lock is held over this operation the potential exists to
deadlock the system. To ensure forward progress all memory allocations
in these contexts must us KM_PUSHPAGE which disables performing any I/O
to accomplish the memory allocation.
Previously, this behavior was acheived by setting PF_MEMALLOC on the
thread. However, that resulted in unexpected side effects such as the
exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs
code, but it is more consistent with the right way to handle these cases
under Linux.
This is patch lays the ground work for being able to safely revert the
following commits which used PF_MEMALLOC:
21ade34 Disable direct reclaim for z_wr_* threads
cfc9a5c Fix zpl_writepage() deadlock
eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
Fix non-c90 compliant code, for the most part these changes
simply deal with where a particular variable is declared.
Under c90 it must alway be done at the very start of a block.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>