Fletcher4 implementation using avx512f instruction set

Algorithm runs 8 parallel sums, consuming 8x uint32_t elements per
loop iteration. Size alignment of main fletcher4 methods is adjusted
accordingly. New implementation is called 'avx512f'.

Note: byteswap method can be implemented more efficiently when avx512bw hardware
becomes available. Currently, it is ~ 2x slower than native method.

Table shows result of full (native) fletcher4 calculation for different buffer size:

fletcher4   4KB     16KB    64KB    128KB   256KB   1MB     16MB
--------------------------------------------------------------------
[scalar]    1213    1228    1231    1231    1225    1200    1160
[sse2]      2374    2442    2459    2456    2462    2250    2220
[avx2]      4288    4753    4871    4893    4900    4050    3882
[avx512f]   5975    8445    9196    9221    9262    6307    5620

Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4952
This commit is contained in:
Gvozden Neskovic
2016-07-06 13:42:04 +02:00
committed by Brian Behlendorf
parent 32ffaa3de5
commit 70b258fc96
6 changed files with 182 additions and 10 deletions
+8 -8
View File
@@ -883,14 +883,14 @@ Default value: \fB67,108,864\fR.
Select a fletcher 4 implementation.
.sp
Supported selectors are: \fBfastest\fR, \fBscalar\fR, \fBsse2\fR, \fBssse3\fR,
and \fBavx2\fR. All of the selectors except \fBfastest\fR and \fBscalar\fR
require instruction set extensions to be available and will only appear if ZFS
detects that they are present at runtime. If multiple implementations of
fletcher 4 are available, the \fBfastest\fR will be chosen using a micro
benchmark. Selecting \fBscalar\fR results in the original CPU based calculation
being used. Selecting any option other than \fBfastest\fR and \fBscalar\fR
results in vector instructions from the respective CPU instruction set being
used.
\fBavx2\fR, and \fBavx512f\fR.
All of the selectors except \fBfastest\fR and \fBscalar\fR require instruction
set extensions to be available and will only appear if ZFS detects that they are
present at runtime. If multiple implementations of fletcher 4 are available,
the \fBfastest\fR will be chosen using a micro benchmark. Selecting \fBscalar\fR
results in the original, CPU based calculation, being used. Selecting any option
other than \fBfastest\fR and \fBscalar\fR results in vector instructions from
the respective CPU instruction set being used.
.sp
Default value: \fBfastest\fR.
.RE