The introduction of Intel® Advanced Vector Extensions (Intel® AVX) in 2011 began significantly improved vector processing in each generation of the company's processors—which resulted in these units becoming even more popular platforms for signal and image processing. Recently Intel AVX 2.0 was introduced in the Haswell microarchitecture, for further performance boosts, including:
- Fused multiply-add (FMA) instructions with double peak floating-point throughput to 307 GFLOPS (billion floating point operations per second) at 2.4 GHz in a quad-core 4th generation Intel® Core™ processor
- Extension of most integer instructions to 256 bits for two-times higher peak integer throughput
- Doubling fixed point arithmetic throughput
- New vector gather, shift, and cross-lane permute functions enable more vectorization and more efficient loads and stores, resulting in fixed- and floating-point algorithm improvements
Haswell microarchitecture improvements also contribute to greater performance in signal and image processing applications. These changes include:
- Having the memory pipeline perform two loads and a store operation on each cycle
- Doubling L1 cache bandwidth to 96 bytes per cycle (64 byte read plus 32 byte write)
- Doubling L2 cache bandwidth to 64 bytes per cycle
These upgrades, plus the internal Last Level Cache, 320 GB per second Ring Bus, and DDR3 dual-channel memory (peak memory bandwidth = 25 GB/sec at 1,600 MHz) make sure the processor is constantly "fed" to maximize performance.
Intel AVX 2.0 is most beneficial for applications that are CPU-bound as well as those that require significant time in vectorizable loops with:
- Iteration count ≥ vector width (i.e. ≥ 8 integers, 8 floats, or 4 doubles)
- Integer arithmetic and bit manipulation (i.e. video and image processing)
- Floating point operations that make use of FMAs (i.e. linear algebra)
- Non-contiguous memory access (i.e. those that can use the new gather and permute instructions)
Benchmarking Intel AVX 2.0 processing performance
N. A. Software (NAS) develops and licenses radar algorithms and low-level DSP libraries, including VSIPL (Vector, Signal, and Image Processing Library)—which supports multithreading and is typically used on large multicore and shared memory systems, allowing scalable performance for large problems.
The company has introduced an optimized Intel AVX 2.0 VSIPL designed for complex vector multiply operations, sine/cosine, and split complex FFTs. This library is standalone code not dependent on third party software, which allows the library to be readily recompiled for any operating system, leveraging the benefits of the Intel AVX 2.0 instruction set.
NAS recently used VSIPL to benchmark Intel AVX 2.0. The results showed Intel AVX 2.0 can speed functions upwards of twice the speed of the initial version of Intel AVX (see below).
The following figure provides greater detail of the first item in the previous table, the 1D FFT using split complex data:
David Murray, NAS technical director, notes, "the significant speedups when using the new Intel AVX 2.0 instruction set. These speedups come from the FMA instructions." In separate, wider ranging DSP study, he notes an average speedup of 774 DSP operations on Intel AVX compared to Intel AVX 2.0. "You see a large increase in performance for operations using integer data or short integer data because the Intel AVX 2.0 instruction set contains a wider range of eight-way SIMD [Single Instruction Multiple Data, i.e. the same operation on multiple data sets] vector operations. There is also a large increase in performance with float operations because the Intel AVX 2.0 instruction set contains eight-way SIMD fused multiply-add instructions. While some of the performance speedups with double precision data are due to our algorithm improvements, the integer and float speedups are down to the Intel AVX 2.0 instruction set," Murray says.
NAS also did an Intel AVX 2.0 benchmark with the company's SARMTI (Synthetic Aperture Radar and Moving Target Indication) advanced radar processing algorithm. SARMTI extracts high-resolution data of slow and fast moving objects directly from a synthetic aperture radar image, eliminating the need for a separate moving target (Doppler) radar. Here again AVX 2.0 showed a notable speed improvements of 1.26 to 1.52 times that of Intel AVX. And for a similar benchmarking study, Murray reports, "Our SARMTI application speeds up by between 1.33 and 1.42 on Haswell when compared to Ivy Bridge."
"We can supply Intel AVX 2.0 versions of all our products," Murray concludes. "We have made this investment because the solution, both hardware and software, is significantly faster. However, we still supply Intel AVX and other legacy technology] where we have customer requirements. Customers can be slow in changing their hardware selection, and programs can be tied to a hardware decision for many years to come."
Contact Featured Alliance Member:
Solutions in this blog:
Roving Reporter (Intel Contractor), Intel Intelligent Systems Alliance
Follow me on Twitter: @rickdemeis