The software performance gain you can expect migrating from single- to multi-core depends on a several factors, not the least of which is the architecture and very nature of the application. A logical starting point would be to scale the single-core performance linearly by the number of cores. But there may be some system overhead that experience shows can consume 10-20\% capacity - so a reasonable expectation for many applications is slightly sub-linear, for example 3.2x to 3.8x on a four-core. The really good news is that some embedded applications, such as network packet processing, can actually scale supra-linearly- if the right programming concepts are applied to fully leverage the multi-core platform's features. Intel has demonstrated this by porting the popular open source Snort network intrusion detection software to a four-core processor, achieving more than 6.2x performance over single-core. This article is a brief summary of that Snort exercise.
Exceptional Snort performance was largely achieved through techniques that maximized the benefits of cache memory. Cache efficiency is the performance linchpin to most modern processor systems and now takes on even greater significance in the world of multi-core. The cache-hit rate often correlates to the program's locality of reference, meaning the degree to which a program's memory accesses are limited to a relatively small number of addresses. Conversely, a program that accesses a large amount of data from scattered addresses is less likely to use cache efficiently. Multi-core architectures can significantly improve program flow so that cache associated with each individual core is used more effectively. With multiple caches available, developers can optimize data locality, driving higher cache-hit rates and improved overall application performance.
When migrating applications to multi-core there are often numerous approaches to distributing the code among the cores, and those different configurations can yield widely ranging performance. Finding the optimal one may require some experimentation. One obvious and probably simplest trial was to run parallel copies of the full Snort application in each of the four cores, each one handling a quarter of the total packets. This option produced sub-optimal results. After thorough analysis of the code's architecture and dataflow, developers finally converged on a high-performance configuration by combining the concepts of functional pipelining and flow-pinning.
Functional pipelining is a technique that sub-divides the software into multiple sequential stages and assigns these stages to dedicated cores. Each core runs its application stage and then hands off the intermediate results to the next stage, and so on. Pipelining can increase locality of reference since each core runs a subset of the entire application, potentially increasing the cache hit rate associated with executing instructions. Pipelining also provides an opportunity for load sharing since you can assign multiple cores to the stages that are more CPU-intensive. Snort lent itself well to pipelining since the existing code was already designed as a sequence of well-bounded functional stages (those named in the diagram).
Flow-pinning is a technique that overlays the pipelined configuration. Performing functions such as TCP reassembly on a large number of TCP flows is likely to access a large amount of data over a large range of memory locations, resulting in reduced cache efficiency. Restricting, or "pinning" individual TCP flows to a single core improves data locality because each core operates on a smaller number of flows. This translates into less data access over a smaller range of memory locations for better cache efficiency. The diagram box "Packet Classify Hash" implements this pinning function by directing packets from the same TCP flow to the same core.
By applying pipelining and flow-pinning, developers were able to nearly double cache efficiency, leading directly to the high application performance. And further cache optimization could potentially yield even greater results.
Achieving more than 6.2x four-core performance over single-core is a great real-world example of the potential of Intel multi-core processors. Since Snort is rather typical of a packet processing application, it is likely that the supra-linear performance gains described here can be generalized to other applications with intensive packet processing requirements.
There is one caveat I should mention. Our demonstration was done using Snort version 2.2.0, which has since been superceded by newer versions with increased functionality and a modified software architecture. While the basic transformation process and optimization techniques could be applied to the current release, the optimal multi-core software configuration and performance would likely differ from the results of our exercise.
To view the complete white paper, visit: http://download.intel.com/technology/advanced_comm/31156601.pdf
+ For an explanation of the 3.2x - 3.8x performance scaling estimate, see page 12 of the white paper http://download.intel.com/technology/advanced_comm/315697.pdf