Intel® first added support for simultaneous multithreading in the Pentium® 4 family of products, calling the technique Intel® Hyper-Threading (HT) technology. HT allows multiple-threaded applications to run simultaneously on a processor that has multiple parallel execution engines. The technology was not present in the Intel® Core™ microarchitecture but reappeared in the Nehalem microarchitecture that is the basis for the newest Xeon® and Core i7, i5, and i3 families. The Nehalem microarchitecture includes four separate instruction decoders that operate in parallel enabling a high degree of parallel instruction execution on a single core and maximizing performance of threaded applications.
Nehalem-based processors include three identical instruction decoders for simple instructions, and one for complex instructions. The processors can use the four decoders to support two threads on each core. And Intel offers Nehalem-based processors with two to six cores. That means the high end of the family, the new Xeon 5600 series, can handle 12 threads simultaneously.
Embedded design teams can leverage HT technology in several ways. For example, you can use HT for virtualization and run multiple operating systems on one core. Early Atom™ family members included HT support although the microarchitecture only included two parallel instruction paths. I posted a blog last year about how you can use that HT support for virtualization even though the early Atom processors lacked Intel® Virtualization Technology (VT).
In the case of the Xeon family, the more likely use of HT is for performance since the processors integrate VT support. With HT, design teams can gain performance in compute-intensive application through threaded applications.
Radisys*, for example, performed a case study in conjunction with the Georgia Institute of Technology (Georgia Tech) on a medical imaging application hosted on a Xeon 5500 series system. Applications such as CT, MRI, and ultrasound are capturing an increasing amount of data that must be processed immediately. For example, CT scans now have sub-millimeter resolution. But doctors want to process the data in seconds. The case study set out to evaluate how parallel processing could accelerate such imaging applications.
Engineers ran benchmark tests on a dual-processor server, with quad-core-based CPUs. The implementation provided eight cores and the ability to process 16 threads simultaneously. The application was a Katsevich algorithm that is used for 3D CT reconstruction. As the below figure illustrates, execution time improved with the number of threads. Although as you might expect the biggest gain came going from one to two threads and the benefit diminished as the number grew to 16. Still, the performance gain essentially comes for free, so even the relatively small gains that come moving from 8 to 16 threads are worthwhile.
RadiSys offers a range of HT-enabled platforms based on Intel processors. Back in March, for example, the company announced the Procelerant RMS420-5520DT embedded server based on the new Xeon 5600 series. The product supports as many as 12 cores in a dual-processor configuration. Medical imaging is a target market for the product along with video streaming, and high-performance test & measurement.
Have you experimented with threaded applications and a HT-enabled processor? If so, what kind of performance gain did you measure? If not, what are the obstacles to using HT? Please share you experience with fellow followers of the Intel® Embedded Community.
Roving Reporter (Intel Contractor)
Intel® Embedded Alliance
*RadiSys is a Premier Member of the Intel® Embedded Alliance