With the latest AMD processors adding support for AVX-512, the benefits of vectorization become bigger than ever. The processor can now issue 8 double precision floating operations in parallel, per thread. This includes multiply/add, and even square root!
The program uses the processor feature detection function automatically run by the Microsoft Visual Studio supplied startup code. To see the AVX level supported, run the command line help:
sv help
Look for this line:
groupsize= limit double precision vectorization level [8]
The 8 shows that 8 double precision operations will be issued in parallel. Here are the other possibilities:
8: AVX-512
4: AVX-2
2: AVX
The command line option groupsize= lets you force a different vectorization level than detected.
Here is example code generation. The function updateIIR2stage8wAVX512 updates 8 two stage biquad filters in parallel. Thanks to the extra 16 registers available with AVX-512, the compiler is able to load the coefficients once and not have to reload them for each stage updated:
updateIIR2stage8wAVX512:
0000000140005E50: vmovupd zmm2,zmmword ptr [rcx+2C0h]
0000000140005E57: vmovupd zmm18,zmmword ptr [rcx+80h]
0000000140005E5E: vmovupd zmm19,zmmword ptr [rcx+0C0h]
0000000140005E65: vmovupd zmm20,zmmword ptr [rcx+100h]
0000000140005E6C: vmovupd zmm1,zmmword ptr [rcx+240h]
0000000140005E73: vmovupd zmm3,zmmword ptr [rdx]
0000000140005E79: vmulpd zmm17,zmm3,zmmword ptr [rcx]
0000000140005E7F: vfmadd231pd zmm17,zmm2,zmmword ptr [rcx+40h]
0000000140005E86: vfmadd231pd zmm17,zmm18,zmmword ptr [rcx+300h]
0000000140005E8D: vfmadd231pd zmm17,zmm19,zmm1
0000000140005E93: vfmadd231pd zmm17,zmm20,zmmword ptr [rcx+280h]
0000000140005E9A: vmovupd zmmword ptr [rcx+280h],zmm1
0000000140005EA1: vmovupd zmmword ptr [rcx+240h],zmm17
0000000140005EA8: vmovupd zmmword ptr [rcx+300h],zmm2
0000000140005EAF: vmovupd zmmword ptr [rcx+2C0h],zmm3
0000000140005EB6: vmovupd zmm4,zmmword ptr [rcx+1C0h]
0000000140005EBD: vmulpd zmm3,zmm17,zmmword ptr [rcx]
0000000140005EC3: vfmadd231pd zmm3,zmm4,zmmword ptr [rcx+40h]
0000000140005ECA: vfmadd231pd zmm3,zmm18,zmmword ptr [rcx+200h]
0000000140005ED1: vmovupd zmm2,zmmword ptr [rcx+140h]
0000000140005ED8: vfmadd231pd zmm3,zmm2,zmm19
0000000140005EDE: vfmadd231pd zmm3,zmm20,zmmword ptr [rcx+180h]
0000000140005EE5: vmovupd zmmword ptr [rcx+180h],zmm2
0000000140005EEC: vmovupd zmmword ptr [rcx+140h],zmm3
0000000140005EF3: vmovupd zmmword ptr [rcx+200h],zmm4
0000000140005EFA: vmovupd zmmword ptr [rcx+1C0h],zmm17
0000000140005F01: vmovupd zmm0,zmm3
0000000140005F07: ret