Fletcher4 implementation using avx512f instruction set

Algorithm runs 8 parallel sums, consuming 8x uint32_t elements per loop iteration. Size alignment of main fletcher4 methods is adjusted accordingly. New implementation is called 'avx512f'. Note: byteswap method can be implemented more efficiently when avx512bw hardware becomes available. Currently, it is ~ 2x slower than native method. Table shows result of full (native) fletcher4 calculation for different buffer size: fletcher4 4KB 16KB 64KB 128KB 256KB 1MB 16MB -------------------------------------------------------------------- [scalar] 1213 1228 1231 1231 1225 1200 1160 [sse2] 2374 2442 2459 2456 2462 2250 2220 [avx2] 4288 4753 4871 4893 4900 4050 3882 [avx512f] 5975 8445 9196 9221 9262 6307 5620 Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4952
2026-05-24 11:18:52 +03:00 · 2016-07-06 13:42:04 +02:00
parent 32ffaa3de5
commit 70b258fc96
6 changed files with 182 additions and 10 deletions
@@ -883,14 +883,14 @@ Default value: \fB67,108,864\fR.
 Select a fletcher 4 implementation.
 .sp
 Supported selectors are: \fBfastest\fR, \fBscalar\fR, \fBsse2\fR, \fBssse3\fR,
-and \fBavx2\fR. All of the selectors except \fBfastest\fR and \fBscalar\fR
-require instruction set extensions to be available and will only appear if ZFS
-detects that they are present at runtime. If multiple implementations of
-fletcher 4 are available, the \fBfastest\fR will be chosen using a micro
-benchmark. Selecting \fBscalar\fR results in the original CPU based calculation
-being used. Selecting any option other than \fBfastest\fR and \fBscalar\fR
-results in vector instructions from the respective CPU instruction set being
-used.
+\fBavx2\fR, and \fBavx512f\fR.
+All of the selectors except \fBfastest\fR and \fBscalar\fR require instruction
+set extensions to be available and will only appear if ZFS detects that they are
+present at runtime. If multiple implementations of fletcher 4 are available,
+the \fBfastest\fR will be chosen using a micro benchmark. Selecting \fBscalar\fR
+results in the original, CPU based calculation, being used. Selecting any option
+other than \fBfastest\fR and \fBscalar\fR results in vector instructions from
+the respective CPU instruction set being used.
 .sp
 Default value: \fBfastest\fR.
 .RE