In my last blog, I looked at DSP benchmarks for Intel® Architecture processors.  While these benchmarks are good place to start, they only tell part of the story.   DSP board vendor Curtiss Wright recently surveyed its customers and found that memory performance and inter-processor bandwidth were even more important than raw FLOPs ratings.  In this blog we’ll look at some benchmarks that show how both types of bandwidth have seen major improvements in the latest Intel Architecture processors.  We’ll also look at DSP co-processing solutions using FPGAs as well as the next-generation DSP features coming to the Intel Architecture in 2011.


But first, let’s briefly revisit our look at DSP FLOPS.  In my last blog I showed how threading a DSP application leads to performance gains.  However, we only looked at running one thread per core, as the processor did not support Intel® Hyper-Threading Technology (Intel® HT).  That leaves an obvious question: How much performance gain can you expect from Intel HT in a DSP workload?


Curtiss Wright took a look at this question using a single quad-core Intel® Xeon® processor-based platform.  The company executed complex floating-point multiplication using one to eight threads and obtained the results depicted in Figure 1.  As expected, the performance scales linearly from one to four threads, as each additional thread employs previously unused cores. The eight-thread case activates Intel® HT for a roughly 25\% performance bump.  This is a smaller gain than we saw from adding more cores, but the fact that we can get this much extra performance without any extra cores is rather impressive.



Figure 1. Performance scaling for a quad-core Intel Xeon.  The eight-thread case demonstrates the extra performance available with Intel HT. 


Now let’s turn our attention back to memory bandwidth and inter-processor bandwidth.   Trenton Technology recently ran a series of benchmarks that look at these metrics.  The company tested its Trenton JXT6966 board, which features two quad-core Intel® Xeon® C5500 Series processors, against the older Intel® Xeon® E5440-based  Trenton MCXT.  The results show the benefits of the upgrades in the Intel Xeon C5500 Series processors, which include integrated memory controllers and PCI Express* Gen 2.0 links.


First up are the memory bandwidth benchmark results in Figure 2.  The integrated memory controller gives the Intel Xeon C5500 Series processors a tremendous advantage here. The results illustrate an overall memory bandwidth performance increase of approximately 440\% compared with the Intel Xeon E5440 architecture.



Figure 2.  The integrated memory controller gives the Intel Xeon C5500 Series processors a tremendous bandwidth advantage. 


Figure 3 shows the memory latency benchmark results.  These results show how integrating the memory controller into the CPU dramatically reduces latency and associated memory delays. The results illustrate a 35\% reduction in memory latency and a 57\% reduction in memory delays with the Intel Xeon EC5549 and Intel 3420 PCH (Ibex Peak) chipset combination compared to the Intel Xeon E5440 with the Intel 5000P MCH and the Intel ESB2 ICH.



Figure 3. Memory latency benchmark results show the benefits of the integrated controller. 


For many DSP systems, PCI Express performance will also be important.  I don’t have any PCI Express performance data, but I expect that the integration of the PCI Express controller on to the CPU will also provide significant bandwidth and latency improvements.


Finally, Figure 4 shows inter-core bandwidth and latency.  The Intel Xeon EC5500 processor series enables a 166\% performance gain in inter-core bandwidth with a corresponding reduction in inter-core latency of 174\% on compared to the older Intel Xeon EC5549 processors.



Figure 4. The Intel Xeon EC5500 processor series enables improvements in inter-core bandwidth and latency. 


That’s all I have to say on the current state of the art, but I’d like to spend a few minutes looking at the future of DSP on Intel Architecture processors.  One area that is showing promise is the use of FPGAs as co-processors to Intel Architecture processors.  For example, XtremeData* offers a module with 3 Altera* Stratix* III FPGAs that plugs into a dual-socket server board. To benchmark the solution’s performance, XtremeData implemented a polyphase filter that takes 16-bit fixed point input data and outputs single precision floating point data. Early testing indicates that the filter requires only 1/4th of the available FPGA gates, yet provides a roughly 42X performance increase over a single-core implementation on the CPU.


As another example, Nallatech* is shipping modules with Xilinx* Virtex*-5 FPGAs for 4-socket servers. To demonstrate the modules’ performance, Nallatech implemented four 10GbE channels on the FPGA with AES-256 encryption/decryption at 15 Gbps on each of the four channels.  After implementing this demonstration, 60\% of the FPGA fabric remained available for packet/message filtering.


These co-processing solutions are enabled by Intel® QuickAssist Technology.  Today this technology is used to couple a FPGA to the CPU via the front-side bus.  In the near future, it will be used to connect the FPGA using Intel® QuickPath Interconnect.  You can read more about the upcoming developments in my Intel QPI sneak peak blog.  I also recommend the white paper Deploying a Unified API for Algorithm Acceleration for a good general overview of Intel QuickAssist Technology.


The other big DSP news I’m looking forward to is the arrival of the Intel® Advanced Vector Extensions (Intel® AVX) ISA in 2011. Intel AVX will double the size of the SIMD registers from 128 bits to 256 bits, enabling a theoretical doubling in performance.  The new ISA will also have a number of major new features that improve the flexibility and capability of SIMD operations. Intel AVX deserves a whole blog of its own, but I’ll save that until the first chips arrive in early 2011.  For now I’ll just note that the Intel AVX instructions are already supported in the Intel® Software Development Emulator, allowing you to start working with the new instructions today.  Go take a look and let me know what you think!


Curtiss Wright Controls and Trenton Technology are Affiliate members of the Intel® Embedded Alliance



Kenton Williston

Roving Reporter (Intel Contractor)

Intel® Embedded Alliance


Embedded Innovator magazine