Maximizing embedded computation performance depends on creating an optimal processor configuration using the right combination of variables including processor core count, socket count, processor operating frequency, thread count, and FSB speed. Because the fastest configuration does not always meet budget, thermal, or power constraints, you need to understand how these various factors contribute to overall computing performance to make the right design decisions. This blog entry explores performance impact of choosing single-socket versus dual-socket configurations for systems that employ multiple processor cores.
Note that the sorts of configuration optimizations listed above are especially important when deciding how to configure various Intel® Xeon®-processor-based systems such as Kontron’s KTC5520-EATX dual-socket server board (Figure 1) and Emerson Network Power’s dual-socket ATCA-7369 AdvancedTCA server blade (Figure 2). Kontron and Emerson Network Power are Premier members of Intel® Embedded Alliance.
Figure 1: Kontron’s KTC5520-EATX dual-socket server board
Figure 2: Emerson Network Power’s dual-socket ATCA-7369 AdvancedTCA server blade
The experimental test results in this Roving Reporter blog entry are based on the SPEC CPU2006 benchmark suite, which consists of many individual sub-test programs. The overall SPEC CPU2006 benchmark score averages results from the sub-tests. (For complete information on SPEC CPU2006 refer to www.spec.org.)
Test results in [Reference 1] show that the performance gain from four processor cores versus two processor cores varies by sub-test and depends on the limiting factors mentioned above. These published test results compare the performance of a variety of core and socket configurations based on the Intel® 5100 MCH Chipset architecture and thus provide clues to optimal server-blade configurations. Note that you can expect similar but not identical results from systems based on other Intel® chipset architectures such as the Intel® 5520 Chipset, which is used in the two server boards mentioned above.
Four Cores Versus Two
First, it’s apparent from the experimental results that four processor cores are faster than two. No surprises there. Benchmark tests constrained by CPU-to-memory bandwidth (memory-bound) see little to no benefit from doubling the number of processor cores but those limited by CPU computational bandwidth scale almost perfectly with the number of processor cores.
Lab testing shows that a quad-core platform configuration improves the overall SPECfp-rate_base2006 score by an average of 40\% and the overall SPECint-rate_base2006 score by an average of 57\% over the dual-core configuration. As expected, some computation-heavy sub-tests show nearly perfect scaling when going from a dual-core to a quad-core configuration while memory intensive sub-tests show no performance improvement at all.
Performance testing of a single-socket architecture with a quad-core CPU versus dual socket architecture with a quad-core CPU pits four cores in one socket against eight cores distributed across two sockets. Again, scaling results depend heavily on CPU computation power and memory bandwidth. Computation power in this comparison is doubled so you can expect about the same scaling here as with dual-core versus quad-core results, but note that the second socket’s additional FSB provides some performance benefit. Tests show a 48\% benefit on SPECfp_rate-base2006 and 59\% for SPECint_rate_base2006 for the eight-core, two-socket configuration versus the four-core, single-socket configuration.
Single Socket Versus Dual Socket
When contemplating the use of one socket with four cores versus two sockets populated with two cores each, resulting in an equal number of processor cores, the expected results aren’t so obvious. Generally, dual-socket configuration with dual-core CPUs in each socket yields higher performance than the four-core, single-socket configuration but the dual-socket configuration also increases system cost and power consumption.
At first glance, the performance of these two configurations seems very similar. Results show a 6\% increase for SPECfp-rate_base2006 and a 3\% improvement for SPECint_rate_base2006, which is within the test noise so there may be no overall improvement for the integer performance results at all. But you must look deeper than the average test number to see the real effects of a second socket. Figure 3 illustrates the observed performance for the individual results from the SPEC CPU2006 benchmark tests. The Figure shows the differences between a 1-socket, quad-core architecture (dark blue bars) versus a two-socket, dual-core architecture (light blue bars), both running at 2.33 GHz. From the observed performance differences, it’s clear that some of these SPEC CPU2006 benchmark sub-test programs are compute-bound and therefore do not benefit from a second FSB. Other sub-test programs are constrained by memory bandwidth and therefore do benefit from the additional FSB.
Figure 3: SPEC CPU2006 Floating Point benchmark results for four processor cores using 1-socket, quad-core and 2-socket, dual-core configurations
Many of the SPEC CPU2006 sub-tests perform equally well running on either configuration. The equal lengths of the bars in Figure 3 attest to that. Such sub-tests are CPU-bound and performance therefore depends only on the processor core count and operating frequency. Some of the sub-tests perform significantly better on the dual socket configuration. These sub-tests are memory-bound, not CPU-bound.
The dual-socket configuration with its extra FSB provides higher CPU-to-memory bandwidth and throughput and Figure 3 shows that the two-socket configuration improves sub-test results for memory-bound SPEC CPU2006 Floating Point sub-tests including 410.bwaves, 433.milc, 437.leslie3d, 450.soplex, 459.GemsFDTD, 470.lbm, 481.wlf, and 482.sphinx3.
The SPEC CPU2006 test results demonstrate that the difference in computation performance between dual-core, dual-socket designs and quad-core, single-socket designs—both with a total of four processor cores—can be expected to be similar but the improved memory throughput of a dual-socket configuration compared to single-socket configuration can improve performance by as much as a 27\% on some memory-bound sub-tests such as 437.leslie3d. Note that these multicore benchmark results are specific to the SPEC CPU2006 benchmarks and the Intel® Nehalem CPUs.
Results from other multicore/multi-socket CPU experiments with dual-core Intel® Xeon processors suggest that some application code—the SNORT open-source network intrusion prevention and detection system for example—can exhibit perfect performance scaling or even supra-linear scaling (beyond perfect scaling) with multiple processor cores in one socket, depending on how the original code is written [REF 2]. Specifically, data locality and cache hits/misses for threads running on multicore and multi-socket systems and the amount of data sharing among those threads will have large affects on multicore code performance. If data is held in shared caches within multicore CPUs, threads can run much faster when the processor cores occupy one socket.
Do you know if your applications are compute-bound or memory-bound? What do you conclude about using multiple cores and multiple sockets in your next application? What does the above information tell you?
Note: SPEC, SPECint, SPEC@, and SPECrate are trademarks of the Standard Performance Evaluation Corporation.
1. Perry Taylor, Configuring and Tuning for Performance on Intel® 5100 Memory Controller Hub Chipset Based Platforms, Intel® Corp, Intel® Technology Journal, Volume13, Issue 1, 2009, pages 16-28.
2. Intel White Paper: Supra-linear Packet Processing Performance with Intel® Multi-core Processors, 2006. Steve Leibson Roving Reporter (Intel Contractor) Intel® Embedded Alliance
Roving Reporter (Intel Contractor)
Intel® Embedded Alliance