Rework of fletcher_4 module

- Benchmark memory block is increased to 128kiB to reflect real block sizes more accurately. Measurements include all three stages needed for checksum generation, i.e. `init()/compute()/fini()`. The inner loop is repeated multiple times to offset overhead of time function. - Fastest implementation selects native and byteswap methods independently in benchmark. To support this new function pointers `init_byteswap()/fini_byteswap()` are introduced. - Implementation mutex lock is replaced by atomic variable. - To save time, benchmark is not executed in userspace. Instead, highest supported implementation is used for fastest. Default userspace selector is still 'cycle'. - `fletcher_4_native/byteswap()` methods use incremental methods to finish calculation if data size is not multiple of vector stride (currently 64B). - Added `fletcher_4_native_varsize()` special purpose method for use when buffer size is not known in advance. The method does not enforce 4B alignment on buffer size, and will ignore last (size % 4) bytes of the data buffer. - Benchmark `kstat` is changed to match the one of vdev_raidz. It now shows throughput for all supported implementations (in B/s), native and byteswap, as well as the code [fastest] is running. Example of `fletcher_4_bench` running on `Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz`: implementation native byteswap scalar 4768120823 3426105750 sse2 7947841777 4318964249 ssse3 7951922722 6112191941 avx2 13269714358 11043200912 fastest avx2 avx2 Example of `fletcher_4_bench` running on `Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz`: implementation native byteswap scalar 1291115967 1031555336 sse2 2539571138 1280970926 ssse3 2537778746 1080016762 avx2 4950749767 1078493449 avx512f 9581379998 4010029046 fastest avx512f avx512f Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #4952
2026-05-24 03:08:51 +03:00 · 2016-07-12 17:50:54 +02:00
parent 70b258fc96
commit fc897b24b2
7 changed files with 385 additions and 191 deletions
@@ -92,7 +92,7 @@ fletcher_4_avx2_fini(zio_cksum_t *zcp)
 }

 static void
-fletcher_4_avx2(const void *buf, uint64_t size, zio_cksum_t *unused)
+fletcher_4_avx2_native(const void *buf, uint64_t size, zio_cksum_t *unused)
 {
 	const uint64_t *ip = buf;
 	const uint64_t *ipend = (uint64_t *)((uint8_t *)ip + size);
@@ -137,9 +137,11 @@ static boolean_t fletcher_4_avx2_valid(void)
 }

 const fletcher_4_ops_t fletcher_4_avx2_ops = {
-	.init = fletcher_4_avx2_init,
-	.fini = fletcher_4_avx2_fini,
-	.compute = fletcher_4_avx2,
+	.init_native = fletcher_4_avx2_init,
+	.fini_native = fletcher_4_avx2_fini,
+	.compute_native = fletcher_4_avx2_native,
+	.init_byteswap = fletcher_4_avx2_init,
+	.fini_byteswap = fletcher_4_avx2_fini,
 	.compute_byteswap = fletcher_4_avx2_byteswap,
 	.valid = fletcher_4_avx2_valid,
 	.name = "avx2"