Generally speaking, microprocessors employ an architecture called SISD (single instruction single data) where a processor core applies an instruction to a single data element. Now with superscalar architectures, Hyper Threading technology, and multi-core processor implementations, modern processors handle many SISD operations simultaneously - but that's a subject for another day. Today, let's discuss how Intel® extensions to the general purpose Intel® Architecture (IA) or x86 architecture, can accelerate applications such as image processing or speech recognition. SIMD (single instruction multiple data) extensions called SSE (Streaming SIMD Extensions) allow one instruction to operate on multiple data elements. SSE allows IA processors to handle applications for which other general-purpose processors might require an additional resource such as a dedicated DSP IC.


Intel has a long history of implementing extensions that include both new instructions and the microarchitecture resources that can accelerate those instructions. The trend started with the floating point unit (FPU) first integrated in the 80486 family, followed by the Extended Temperature Pentium® processor with MMX technology, and today several evolutions of SSE. SSE first came to the Intel ®Pentium® III processor adding 8, 128-bit registers with floating-point math support. Subsequently SSE2 added double precision math, and SSE3 added DSP-oriented instructions. Today, the state of the art is SSE4.1 that was implemented in the Core microarchitecture (Intel® Core-2 Duo) processors when Intel migrated to the 45-nm design code named Penryn, and SSE4.2 that was implemented in the newest microarchitecture code named Nehalem (Xeon® Processor 5500 in Intel's embedded family).


Oversimplifying the advantages afforded by SIMD extensions, consider a very simple graphic application. Three data elements define a pixel in terms of color and brightness - one each for the red, green, and blue colors. A single SIMD instruction can operate on all three elements simultaneously easily changing pixel brightness using a single instruction.


To get more background on SIMD, you might check out Intel's SSE4 description. In addition, Wikipedia offers a detailed article. Here is the root Wikipedia article that includes links to articles on each stage of the SSE evolution.


Now how might you take advantage of SSE? The task at hand seems quite complex given the constant evolution in the technology. How do you ensure your code will run on a range of processors? And how do you implement SSE in embedded applications with real-time requirements? All good questions, and the answer is not trivial, but certainly not as difficult as it might seem.


Intel's software team offers a library of software functions called the Intel IPP (Intel® Integrated Performance Primitives). IPP leverages both the MMX and all flavors of the SSE instructions. Moreover, combined with a compiler, IPP allows you to write code once that will run on any IA processor and leverage the most advanced SSE instructions, and the underlying hardware accelerators, that are available on the target platform. The code snippet below depicts how the dispatch feature is used to determine the hardware platform at hand and configure the library for the platform at run time.
























You can use the IPP with compilers from multiple sources, and with multiple operating systems. For instance, TenAsys offers the INtime RTOS that combines real-time capabilities and virtualization support. TenAsys is an Affiliate member of the Intel® Embedded and Communications Alliance (Intel® ECA). In a typical implementation, INtime runs on one virtual processor handling real-time tasks, while Windows runs on a separate virtual processor handling tasks such as the user interface - a dual-operating system on one processor scenario unique to IA. And TenAsys supports the IPP Library to leverage both MMX and SSE instructions.


I'll go into more detail with SSE examples in the future, but here are several more places that you can find help with SSE and SIMD in general. Intel and TenAsys co-authored an article that describes how to leverage SIMD instructions, and the article also includes some specific examples and details of performance gains. For example, on a computer vision application, SIMD instructions provided a 4.3x advantage relative to an expertly crafted C implementation using only standard IA instructions.


A professor in the electrical engineering departments at the University of Colorado authored an excellent paper that provides background on SIMD and details on algorithm acceleration. Finally, the website includes an article specifically on SSE4 for audio, video, and imaging applications.


Do you have experience with SIMD applications or specifically with SSE? Have you perhaps been able to eliminate a DSP IC using SIMD extensions? Please share your comments with the many followers of the Intel ® ECA community.


Maury Wright

Roving Reporter (Intel Contractor)

Intel® Embedded Alliance