Fortunately there are tools available for Intel embedded processors to optimize systems performance. The key is that these tools are geared towards optimizing performance and not maximizing performance as the default condition. Performance analysis is achieved via a series of tools that measure parameters that skilled programmers can use to achieve performance and size goals. These measurements and reports include:

A call graph provides a graphical view of the flow of an application permitting applications developers to gain a higher level view of the operation of the application. This helps to identify critical functions and timing details. Call graph profiling offers a graphical high-level, algorithmic view of program execution. This is achieved based on instrumenting the executable files used to produce function calling sequence data.


Time-based and Event-based sampling is a statistical method for locating performance bottlenecks imposing a low overhead on the application. Time-based sampling finds “hot spots” that consume a relatively significant amount of CPU time. Event based sampling helps identify possible places where cache misses, branch mis-predictions and other performance issues occur.




Source view sampling results are displayed line by line on the source / assembly code to aid the programmer in analyzing where the data should be associated with the program code.


A counter monitor provides system level performance information. This includes resource consumption during the execution of an application


The Intel Thread Profiler gives programmers a timeline view identifying what threads are doing and how they interact. It shows the distribution of work to threads and locates load imbalances.


A Performance Tuning Utility (PTU) is an optional function that gives VTune analyzer users access to experimental tuning technology. This includes information like Data Access Analysis that identifies memory hotspots and relates them to code hotspots.


Intel Parallel Amplifie is the performance profiler component of Intel Parallel Studio. A VTune user license carries access to Parallel Amplifier. A statistical call graph which is lower overhead than VTune's exact call graph, provides concurrency analysis.


How these tools are used depends on what your optimization goals are. For example, to obtain maximum performance there are a number of tricks available for programmers. For example, consider the following pseudo code:


Do a[i+]=b[i+]*c[i+]

Until i>27;


This code performs one arithmetic operation per iteration through the loop. Ignoring the arithmetic capability of the processor we have one loop branch operation per loop iteration. This in the limiting case causes the loop to take twice as much time per full execution of the code fragment as the basic arithmetic functions. So, to speed up this fragment, programmers will perform a loop unrolling operation. Some compilers permit the loop unrolling to be performed automatically by the compiler according to some control switches in the source code. The result in the extreme case of loop unrolling is a single line per stage of the arithmetic operation:


a[1] = b[1]*c[1]

a2] = b[2]*c[2]

a[27] = b[27]*c[27]


In the condensed form, the VTune tool kit will show the loop as a “hot spot” because the iteration causes the arithmetic operation to be counted twenty seven times and report that information in the time based analysis of the code. In the fully unrolled form, the time based analysis will lose the hot spot because the code now consists of a series of individual lines of code.


Loop unrolling can speed up applications, but once unrolled manually it is very difficult to identify the code as a candidate for size reduction. For example, the second piece of code will look like twenty seven unrelated lines of code. In general automated tools do not identify these examples as an iteration. So, as a practical method of optimizing embedded applications, it’s generally best to write code in a dense form first and expand via loop unrolling and other rewrites as required. Using this approach VTune can guide the development process to achieve performance requirements.


Most performance issues gain the most attention from many developers. Much of this approach to performance comes from a general focus on a “memory is free” philosophy. But memory can become a significant cost in embedded systems. For a great many applications the electronic system consists of the processor, support circuits, and memory. It may be obvious that memory comes in discrete units of size, but this is a critical component of system cost. As an example, for small data sets and simple applications data can be stores in variable length arrays that use simple brute force searching in favor of simplified display. The program is small and takes few resources, but the data is loosely packed, taking more space than other techniques. An alternative is the trie. A trie is an ordered graph with an associated array of data. No node in the tree stores the key associated with that node. Instead, its position in the tree shows what key it is associated with. In this type of data structure information retrieval is a more complex process and takes more processing time to perform any operations as compared to a simple binary structure.


Using performance tuning tools permits developers to try alternative representations quickly with analytical proof of the effects of the alternative representations.




There are alternative tools available for performance analysis. Green Hills Software (1) offers The Performance Profiler. The Profiler provides a view into the behavior of the program by precisely specifying:


  • the percentage of time spent executing each source line or instruction
  • the total number of times each line or instruction was executed
  • the total number of times each function was called

Wind River Systems (2) provides a series of run-time analysis tools within its Workbench product:


  • System Viewer
  • Memory Analyzer
  • Performance Profiler
  • Data Monitor

Regardless of the tool suite that you use for developing embedded applications, basic tools exist within each of the mainstream development tool kits to aid in analyzing and optimizing systems performance.


Have you considered how you will optimize your next application? Speed/size/complexity of the application?



  1. Green Hills Software, Inc is an Affiliate Member of the  Intel® Embedded Alliance
  2. Wind River Systems is an Associate Member of the Intel® Embedded Alliance

Henry Davis
Roving Reporter (Intel Contractor)
Intel(r) Embedded Alliance