Embedded applications such as imaging, network intrusion detection, and call processing are good examples of applications that benefit from multi-core. Performance is increased because multiple threads running on multiple cores can process many requests simultaneously. This article provides a high-level summary of optimizing a serial designed application for multi-core. The exercise uses the open-source AMIDE (A Medical Imaging Data Examiner) application as the example. Using rendering techniques, AMIDE generates a display image from complex medical imaging data sets.


When threading an application for multi-core, the largest performance gains come from parallelizing the portions of code where the most CPU-intensive processing is performed. This requires thorough analysis of the application and also suggests that the serial code should be optimized prior to introducing parallelism. Then baseline performance is measured and a goal set for the target multi-core platform.


Threading is introduced to exploit the parallelism in an iterative performance-tuning methodology consisting of the steps illustrated below. Following the 80/20 rule, as the most compute-intensive blocks are optimized (parallelized), making further changes will reach diminishing returns at some point. So, don't lose sight of the goal.





Figure 1 - Performance Tuning Methodology



You can generate an alternative from two approaches to parallelize the code - decomposing by data or by control (functional). When there are a number of independent tasks that run in parallel, the application is suited to functional decomposition. When there is a large set of independent data that is processed through the same operation, data decomposition may be better.



Once the parallelization approach is decided, threads can be introduced into the code by multiple methods. These include the implicit (compiler based) threading model of OpenMP, and explicit threading, such as POSIX (pthreads). Explicit threading is coded manually by the programmer. You have complete control but also bear responsibility for all of the thread management, such as starting and stopping threads, synchronization code, etc. On the other hand, with implicit threading the compiler creates all that underlying parallel code for you. Implicit threading can be appealing to a developer because it minimizes the coding effort, reduces the chance for bugs, provides OS portability, and also allows one code base to be used for both the serial and parallel versions of an application. With either method, after threading the code it needs to be tested/debugged to ensure that no errors were introduced.



The final step in the tuning cycle is measuring the code's performance. Key metrics that affect the overall performance include: cache hit rate, CPU utilization, synchronization overhead, and thread stall overhead. From these data you can determine the efficiency and level of optimization of the code. With multiple threads executing simultaneously, there are many opportunities for bottlenecks and resource contention issues. Sophisticated analysis tools are available to help quickly identify these problem areas, enabling you to tune your application accordingly.



The study of AMIDE concentrated on its core image rendering engine. Data flow analysis identified the major code blocks and the amount of time spent in each, and performance analysis identified the area of code where the majority of processing occurred. Since the application was found to have inherent data parallelism, it was threaded using the data decomposition approach and POSIX was chosen as the threading method. OpenMP would have been a good candidate for implementing the threads, but I suspect the folks who threaded AMIDE were already comfortable with POSIX threads.



The net result of threading AMIDE to run on an Intel® four core system (two dual-core processors) was measured performance at 3.3x compared to the single core baseline.



And how long does this effort take, you ask? The answer depends on the nature of the starting code, the parallelization experience of the engineers, and their knowledge of the specific application. Real-world examples, such as AMIDE, have shown that it's possible to migrate an application from single- to multi-core, with excellent performance results, in only days.



Visit http://download.intel.com/technology/advanced_comm/315697.pdf to read the full AMIDE case study.






Message Edited by serenajoy on 03-11-2009 08:39 PM