The embedded Intel® Xeon® multicore processors based on the Nehalem architecture (L5508 and L5518) deliver very high computing performance to embedded systems by teaming multiple Intel® 64 architecture hyper-threaded processor cores with a shared L3 cache, a newly designed integrated memory controller (IMC), and optimized high-speed links to adjacent processor sockets and I/O controllers. The new IMC manages three DIMM channels and supports three independent physical DDR3 memory buses, permitting very large system-memory capacities. At the upper limits of memory capacity, system designers must make certain tradeoffs between physical memory, available memory pages, memory bus speeds, and available board space.

The obvious approach to memory subsystem design is to maximize memory bandwidth by plugging DIMMs into all three memory channels for every application. Experienced system designers know that additional memory capacity and bandwidth almost always improve application performance and experiments made by IBM and Intel indicate that populating all three of a Nehalem processor node’s memory channels boosts memory parallelism, resulting in significantly faster memory transactions—specifically for applications running small, dedicated, static operations on a relatively small number of static data structures distributed evenly across the memory channels. But that’s a limited subset of all possible applications. Real-world situations are generally not that simple.

In some circumstances, populating just two of the three memory channels of a Nehalem processor’s memory subsystem allows a larger number of simultaneous active memory page definitions than does populating all three DIMM channels. Having more active page definitions results in faster memory transactions for embedded real-time applications requiring large numbers of simultaneously active memory segments (code, data, and/or message heap). The notion that it’s better to populate two DIMM channels rather than three for certain embedded system designs is certainly counter-intuitive and therefore merits exploration.

Embedded Nehalem Memory Subsystem Overview

As Figure 1 shows, each internal Nehalem processor core has private L1 and L2 caches. The on-chip processor cores share an L3 cache, called a last-level cache or LLC. The LLC improves each processor core’s effective data transfer rate by further reducing the need to access off-chip memory.

 

 

198iB9BD95BCBBE37506

Figure 1: Intel® Xeon® 5500 Series (Nehalem) Architecture
 

Shared Physical Memory Resources

The on-chip processor cores share the LLC, IMC, the four QPI (QuickPath Interconnect) serial links, and the three DDR3 memory interfaces. Each physical DDR3 memory channel can connect to one, two, or three DIMMs and each supports DDR3-1066 or DDR3-800 bus transfers. The DDR3 DIMMs can be single-rank (one row of memory components on the module), dual-rank (two rows of memory components), or quad-rank (four rows of memory components) with a limit of eight ranks per channel. The Nehalem processor memory subsystem supports 1- and 2-Gbit memory devices, so DIMM ranks translate to memory capacity as shown in Table 1.

Table 1: Nehalem DIMM Rank and Capacity

199i61BAF5A13EB3BDAC

There are two key factors to consider when deciding how many DIMM channels to populate:

  1. The IMC provides 8 active memory page assignments per rank and can support eight DDR3 ranks in each memory channel (two quad-rank DIMMs per channel for example) but it provides no more than 32 active memory page assignments for each DIMM channel regardless of the amount of memory attached to the channel.

  2. The Nehalem IMC automatically switches to DDR3-800 mode when configured for more than four memory ranks (8 Gbytes) per memory channel regardless of the attached DIMMs’ speed ratings.


Memory Subsystem Resource Population Considerations

 

Intel® makes the following general recommendations for memory DIMM population to optimize performance of a Nehalem memory subsystem:

  • Dual-rank DIMMs deliver the highest data throughput characteristics, especially on hardware platforms supporting two or three DIMMs per channel (rather than one). Single-rank DIMMs limit the number of open page assignments to 8 per DIMM and multiple quad-rank DIMMs in a channel force the memory controller to drop the channel’s transfer rate to DDR3-800.
  • The DDR3-1066 transfer rate is (obviously) faster than DDR3-800 operation. (Note: The long-life Nehalem-EP L55xx series embedded processors do not support DDR3-1333 operation as do commercial versions of Nehalem server processors.)
  • Balanced channel populations (an equal number of DIMMs installed on each memory channel for all Nehalem sockets in the system) are strongly preferred, especially for NUMA (non-uniform memory access) memory resource applications.
  • Registered DIMMs provide better signal integrity than unbuffered DIMMs.


Case Study – The GE Fanuc A10200 Dual-Processor Server Blade

The GE Fanuc A10200 single board computer is an Advanced TCA (ATCA) blade with two Nehalem processor sockets. Each of the board’s processor sockets has a 2-channel memory subsystem architecture and each of the two memory channels accommodates two DDR3 DIMMs. Table 2 shows the possible DDR3 memory populations for this configuration. (Note: Table 2 also shows equivalent 3-channel memory configurations for comparison purposes, but the A10200 server blade does not implement three memory channels because the ATCA board’s form factor cannot accommodate that many DDR3 DIMM sockets.)
 

Table 2: Memory capacity versus number of DDR3 DIMM Channels

200i3A8EFB6095C3CFC8

Table 2 shows that the A10200 physical memory architecture is capable of hosting more physical memory and supporting more concurrent active memory page definitions than a 3-channel implementation, given the physical limitation of four DIMM sockets per processor socket enforced by the ATCA board’s form factor. The additional active memory page assignments (a total of 256) available with the two-channel architecture using two quad-rank DIMMs per channel increases the probability that a given memory access will be able to complete without the additional delay required to activate the memory segment containing the requested data. The board would require four additional DIMM sockets (for a total of six DIMM sockets per processor socket) to achieve the same or better performance with three memory channels.

Performance reports suggest that the 2-channel memory configuration is very likely a better-performing solution for applications with relatively high levels of I/O traffic. A 2-channel memory subsystem design is also likely to perform better for applications requiring large numbers of active memory segments to serve multiple software tasks that are dynamically swapped on a regular basis.

What experiences do you have with performance results for 2- and 3-channel memory design using Intel® Nehalem processors? Which do you prefer and why?

GE Fanuc is an Associate member of Intel® Embedded Alliance.

This blog entry is based on the GE Fanuc White Paper: Maximizing Memory Resources on Xeon® 5500-based ATCA blades, 2009, http://www.gefanuc.com/atca_blades_gft757

Steve Leibson
Roving Reporter (Intel Contractor)
Intel® Embedded Alliance