High-performance systems often pair Intel® Xeon® processors with coprocessors such as network processors or FPGAs. These systems can be found in areas like communications, networking, medical imaging, and military/aerospace. Today’s systems often use PCI Express* (PCIe) to connect the CPU and coprocessor, but PCIe imposes limitations in the areas of latency and memory coherence. In this blog we’ll look at the limitations of PCIe and preview forthcoming systems that use the Intel® QuickPath Interconnect to overcome these limitations.

 

Before we peer into the future, let’s look at current systems that connect the coprocessor to the CPU over a PCIe link, using the Netronome NFP-3240 as an example. This network flow processor (NFP) accelerates PKI, bulk cryptography, and deep-packet inspection. It connects to an Intel Xeon chipset using an eight-lane PCIe 2.0 interface, as show in Figure 1.

 

253i77F1401B2C92BAE6

Figure 1. The Netronome NFP-3240 connects to the Intel Xeon chipset via PCIe. (Visit my gallery for the full-size version of this image .)

 

PCIe is popular choice for a number of reasons, including its flexibility and throughput. PCIe connections can run within a board (as is the case on a single-board computer) or can be brought out across a backplane (such as those offered by Portwell). A single PCIe 2.0 lane has a throughput of 500 MB/s, and lanes can be combined to create x2, x4, x8, x16, or x32 links. That means a 32-lane PCI connector (x32) offers a peak throughput of 16 GB/s.

 

Although PCIe offers copious throughput, it has significant drawbacks in the areas of coherence and latency. These problems arise from the fact that PCIe does not maintain memory coherence. If a coprocessor needs to access data that resides in the CPU cache, this data must first be flushed to main memory. Flushing the cache and pulling the data through the PCIe bridge involves considerable overhead. This overhead is one reason that PCIe transfers typically have latencies of 400-500 ns. This latency hinders tight coupling between the CPU and coprocessor. It also eats into the effective bandwidth, particularly if the system requires a lot of small transfers. (For more on the topic of PCIe latency, see PLX Technology’s excellent white paper series as well as the PCIe library at Embedded.com.)

 

One way to overcome these challenges is to connect the coprocessor to the CPU via the front-side bus (FSB). The idea here is to take a multi-processor system and populate some of the FSB slots with a coprocessor instead of a CPU. Figure 2 illustrates this concept. In this example, two of the sockets in a four-socket Intel® Xeon® chipset are occupied by FPGA modules. (Note that the FPGA module doesn’t just take over a logical socket—these modules actually plug into a standard processor socket.) Examples of these FPGA modules include the XtremeData XD2000i* and the Nallatech Intel Xeon FSB FPGA Socket Fillers.

 

251iD999A1A7FF4F8C71

 

Figure 2. A typical use of FSB for coprocessing. Here, two of the four Intel Xeon sockets have been filled by FPGA modules.

 

Unlike PCIe, FSB supports coherence. This means that FSB coprocessors can directly access memory held in CPU cache, significantly reducing latency. According to Ralph Wittig, the Director of Computing Platforms at Xilinx, the latency of FSB accesses is about 100 ns, 4-5 times lower than PCIe. This low-latency connection opens up a new class of high-performance coprocessing. To get an idea of the possibilities, I recommend reading the Altera white paper FPGA Coprocessing Evolution: Sustained Performance Approaches Peak Performance.

 

The challenge with using FSB is that FSB technology is being phased out in favor of the Intel® QuickPath Interconnect (QPI). QPI implements a shared memory architecture, also known as a Non-Uniform Memory Architecture (NUMA). Instead of connecting all processors to a single pool of memory through the FSB and North Bridge (as illustrated in Figure 2), each processor now has an integrated memory controller and its own dedicated memory. Processors can access each other’s memory over QPI, which provides point-to-point connections between processors and other components. (For more on QPI and NUMA, see Maury Wright’s recent blog.)

 

QPI offers a number of advantages over FSB. First, QPI is a very low-latency protocol, with much less overhead and latency than FSB. This makes the interconnect more efficient at passing small messages back and forth, enabling tighter coordination between CPUs and coprocessors. Second, QPI makes coprocessor-attached memory visible to the CPUs, and maintains the coherence of coprocessor memory with the rest of the system. (With PCIe and FSB, the CPU cannot access coprocessor memory directly, and these buses do not maintain coherence between coprocessor memory and shared memory.) This shared memory architecture enables tighter, more efficient CPU-coprocessor coupling. Finally, QPI is designed to go longer distances, and it can be routed through connectors. This makes it possible to connect devices on different boards via QPI, and it creates the potential for plug-in QPI cards.

 

The first QPI-based coprocessors will be shown at IDF in Beijing, which takes places April 13-14. While no details are public yet, all of the vendors mentioned in this blog have announced plans to use QPI. Netronome has announced plans for a QPI-based NFP, while Nallatech and XtremeData are working on QPI modules based on Xilinx and Altera FPGAs. I can’t wait to see who reveals products at IDF!

 

While we are looking forward, I should note that the forthcoming PCIe 3.0 standard is expected to bring some of the same benefits as QPI, including low latency and memory coherence. It is a bit early to talk about PCIe 3.0, however, as the standard is not yet finalized. Finalization is expected to happen later this year, with products arriving in 2011.

 

In the meantime, it is worth noting that Intel® QuickAssist Technology makes it easy to move between PCIe, FSB, and QPI. As shown in Figure 3, this API provides a common function library that with PCIe, FSB, QPI, or traditional Intel® Architecture (IA) based algorithms. The developers I’ve spoken to give the API high marks. “It’s a nice environment to work in,” said Mike Strickland, director strategic and technical marketing for Altera’s computer and storage business unit. “In my opinion they did a nice job.” For a practical example of using Intel QuickAssist Technology across different platforms, I recommend the white paper Scaling Security Application Performance with Intel®QuickAssist Technology.

 

250i693BE260BF50E6AA

Figure 3. Intel QuickAssist Technology provides a common API across PCIe, FSB, and QPI.

 

I’d love to hear about your experiences with coprocessing. What challenges have you encountered with today’s solutions? Will QPI help you, or are you more interested in PCIe 3.0?

 

Portwell is an Associate member of the Intel® Embedded Alliance. Netronome Systems, Inc., Xilinx, Inc, and Altera are Affiliate members of the Alliance.

 

Kenton Williston

Roving Reporter (Intel Contractor)

Intel® Embedded Alliance

Editor-In-Chief

Embedded Innovator magazine