NA Software Ltd* (NASL) recently performed a series of DSP benchmarks on Intel® Architecture (IA) processors. Two results of this study caught my attention. First, NASL found IA processors offer excellent performance—in fact, even low-end Intel® Atom processors did very well on the benchmarks. Second, NASL found that IA processors have class-leading power efficiency. Both of these findings are hugely important—high performance and low power are the main requirements in most DSP systems. The industry seems to be taking notice of these results; vendors like Curtiss Wright and GE Intelligent Platforms are releasing Intel-based boards specifically targeting DSP. We’ll take a look at some of those boards in a moment, but first let’s examine the DSP benchmarks.
In general, DSP performance is driven by three main factors: clock speed, parallelism, and memory system performance. The importance of clock speed on performance is obvious. All other things being equal, higher clock speeds lead to higher performance. Memory system performance matters because DSP applications are data-intensive. Both the size of the on-chip cache and the efficiency of the external memory system play important roles here.
The NASL benchmarks nicely illustrate these points. Figures 1a and 1b show the results of the NASL vector multiply benchmark. Performance scales with clock speed for small vectors, but performance is throttled as the data set approaches the size of the cache. Only the Intel® Core™2 Duo SL9400, which can fit even the largest vectors in its 6MB cache, does not take a performance hit. The vector multiply benchmark also illustrates the impact of memory system efficiency. Note that the Freescale* MPC 8641D outperforms the Intel Atom processors for small vector sizes, but for large vectors, the Intel Atom processors pull ahead. Clearly, the Intel Atom memory system is more efficient than that of the Freescale MPC 8641D. You can read more about this and a host of other DSP benchmarks in the NASL paper Intel® Atom™ Processor Performance for DSP Applications.
Freescale MPC 8641D
Intel Atom Z530
Intel Atom N270
Intel Core 2 Duo SL9400
Figure 1a. Complex vector multiply v1(i) := v2(i)*v3(i). Times in microseconds. Times in italic indicate that the data requires a significant portion, or is too large to fit into cache.
Figure 1.b Complex vector multiply v1(i) := v2(i)*v3(i); MFLOPS = 6 * N / ( time for one vector multiply in microseconds)
The importance of parallelism is a bit more complex and nuanced. DSP applications typically involve a lot of data parallelism, and there are several ways to attack workloads with data parallelism. Single Instruction Multiple Data (SIMD) is one key approach. Figure 2 illustrates how SIMD works on IA processors. Let’s say four single-precision (32 bit) floating point numbers need to be multiplied by a second value. Rather than doing four sequential multiply operations, all four numbers are loaded into a single 128-bit SIMD register. Then the multiply operation is done on all four numbers in a single processor clock cycle.
Figure 2. A typical SIMD operation performs four 32-bit floating-point operations simultaneously.
Intel® Architecture processors support SIMD through the MMX and SSE instruction set extensions. The new Nehalem-based Intel® Core and Intel® Xeon families support the latest SSE4 instruction set, while the Intel Atom family supports the older SSSE3 version. (The main difference between the two versions is that SSE4 adds instructions for video encoding and various packed-integer operations.) It’s worth noting that starting with the Intel® Core 2, all Intel Core and Intel Xeon processors have been able to execute two SSE operations per core per cycle. That means a single core can perform eight 32-bit floating-point operations per cycle. (The Intel Atom also issues two SSE instructions per cycle, but some of its SIMD registers are either 64 bits or 32 bits wide. This limits the Intel Atom’s peak performance to roughly six floating-point operations per cycle per core.)
Threading is another important technique for parallel processing. DSP algorithms can leverage threading through:
- Pipelined execution, where an algorithm is divided into stages and each stage is assigned to a different thread
- Concurrent execution, where the data is divided up and each thread performs the entire algorithm on one piece of the data
- A combination of pipelined and concurrent execution
Threading can produce large performance gains on a multi-core processor—in some cases, performance scales linearly with the number of cores. Current-generation IA processors also support threading through Intel® Hyperthreading Technology (Intel® HT), which supports two threads on a single core. Combine multi-core with Intel HT, and you get the potential for truly massive performance gains.
Another set of NASL benchmarks illustrate the power of threading. NASL ran a SARMTI radar algorithm on a quad-Xeon system with a total of 24 cores. (This system uses an older Xeon that does not support Intel HT.) As shown in Figure3a and 3b, threading increased fairly linearly up to eight threads. Increasing the thread count beyond eight produced somewhat smaller—although still substantial—performance gains. The reason for the drop-off in performance gains is not clear; with some tuning it might be possible to get linear performance gains beyond eight threads. To read more about this benchmark, check out the white paper Optimizing Digital Signal and Image Processing on Intel®Architecture Processors.
Speed Up (1T-24T)
Figure 3a. Total SARMTI Performance Increase: 1 to 24-Cores and Threads (in seconds).
Figure 3b. Total SARMTI Performance Increase: 1 to 24-Cores and Threads (in seconds).
So far we’ve focused on performance, but as I noted earlier, power is also a critical consideration. NASL found that IA processors do well on this metric, as shown in Figure 4. The Intel Atom processors exhibit leading power efficiency across the range of vector sizes, while the Intel Core 2 Duo offers competitive power numbers for vector lengths of 16K and above. An important point to keep in mind is that these benchmark results are somewhat dated. The Intel Core i7 is 20\% more power efficient than the Intel Core 2 Duo benchmarked here. Similarly, the Intel Atom N450 is 20\% more efficient than the Intel Atom N270.
Figure 4. FFT Performance comparison MFLOPS / Watt
So how can you get your hands on all this DSP goodness? Easy. Many vendors offer boards and modules suitable for DSP. For example, some of the benchmarks mentioned in this blog were performed on the GE Intelligent Platforms VR11 VME64 board. This board features an Intel Core 2 Duo running at up to 2.16 GHz, two Gigabit Ethernet ports, USB 2.0, integrated HDD or Flash drive, and many other features. GE also has a number of Intel-based VMX boards including the VPXcel6 SBC622. This single-board computer features an Intel Core i7 at up to 2.53 GHz and a highly ruggedized design that can withstand extreme temperatures, shock, and vibration.
Curtiss Wright has taken things a step further and introduced multiprocessor Intel Core i7 boards designed specifically for DSP. Its first product, the CHAMP-AV5 6U VME64x board, uses two 2.53 GHz dual-core Intel Core i7 processors to deliver performance rated up to 81 GFLOPS. The company is also preparing an OpenVPX Ready (VITA 65) variant of the board known as CHAMP-AV7.
There’s much more I’d like to say on this topic, but it will have to wait for my next blog. Check in next week for a look at the DSP upgrades coming soon in Intel® Advanced Vector Extensions (Intel® AVX), FPGA co-processing solutions, and more.
In the meantime, I’m interested in hearing from you. If you have used Intel-based boards for DSP, how well did they perform? How well did the board features meet your needs?
GE Intelligent Platforms is an Associate member of the Intel® Embedded Alliance. Curtiss Wright Controls is an Affiliate member of the Alliance.
Roving Reporter (Intel Contractor)
Intel® Embedded Alliance
Embedded Innovator magazine