The Intel® Atom™ processor and Intel System Controller Hub (Intel® SCH) US15W constitute a powerful embedded computing platform for low-power processor boards like Kontron’s COM-Express-compatible microETXexpress®-SP module. (Kontron is a Premier member of Intel® Embedded Alliance.) Given some of the design tradeoffs made to emphasize low-power operation, it takes a bit of care to extract the maximum performance from this platform. An understanding of the internal architecture of the Intel® SCH US15W and some experimental results provide excellent insight into how you should configure such systems to maximize PCIe performance.


Figure 1 shows that the Intel SCH US15W chipset employs a communications backbone consisting of two independent 64-bit buses, each running at 1/16 the system’s FSB transfer rate (25 Mtransfers/sec or 33.3 Mtransfers/sec depending on FSB frequency). One of these buses issues memory-read and -write requests. The other is dedicated to read completions.






Figure 1: Backbone Structure of Intel® SCH US15W Chipset


Because all system peripherals and the PCIe interface controllers share this backbone, multiple simultaneous memory accesses severely reduce the instantaneous bandwidth available to any one system peripheral, which is significant—especially for high-bandwidth applications.


Disk accesses, for example, may not change the backbone’s average available bandwidth that much, but they can halve the available bandwidth for short durations during burst accesses. That’s important because most I/O transactions occur in bursts. Even “low-bandwidth” applications can require full-bandwidth bus access for short periods. During these short periods, “low” and “high” bandwidth requirements are indistinguishable.


PCle Performance Measurements


Some performance measurement results can help you improve PCIe performance of systems based on the Intel® SCH US15W chipset. The following performance results were collected and analyzed using a combination of custom PCIe traffic generators and PCIe logic analyzers. The traffic generators created and executed various PCIe traffic patterns. The analyzers gathered performance data. Figure 2 shows the test setup.




Figure 2: Intel® SCH US15W Test Setup



Test patterns employed random physical memory addresses above a certain address floor to prevent operating-system overhead from affecting the measurements. Four different test patterns were used for these tests: two write patterns and two read patterns each using 64- and 128-byte blocks. Two different systems were tested. One employed an Intel® Atom™ Z530 processor running at 1.6 GHz and the other used an Intel® Atom™  Z510 processor running at 1.1 GHz. Both systems used the Intel SCH US15W chipset with appropriate DDR memory speeds for the respective processors.


Figure 3 shows the first test series collected on a 1.1GHz Intel® Atom™ processor and the Intel SCH US15W with 128-byte packets.





Figure 3: Intel® SCH US15W Test Results, 128-Byte blocks (1.1 GHz)


These results have some interesting traits. First, note that the PCIe link doesn't reach theoretical-maximum PCIe transfer rates (250 Mbytes/sec), which should not be a surprise. The 1.1-GHz test system achieves no more than about 133 Mbytes/sec for reads and writes individually on a single PCIe link. With both PCIe links performing reads, test results show a combined throughput of about 133 Mbytes/sec with the transfer rates evenly split. With both PCIe links performing simultaneous writes, test results show a combined transfer rate of 145 Mbytes/sec, again evenly split. When performing reads on one PCIe link and writes on the other, the combined bandwidth is 212 Mbytes/sec split evenly between the two links. The 64-byte packet series for the 1.1-GHz Intel® Atom™ processor, shown in Figure 4, looks very much the same.





Figure 4: Intel® SCH US15W Test Results, 64-Byte blocks (1.1 GHz)


Test results for the 1.6-GHz platform, shown in Figures 5 and 6, show noticeable bandwidth increases over the 1.1-GHz platform. On the 1.1-GHz platform, the FSB operates at 400 Mtransfers/sec while the 1.6-Ghz platform’s FSB operates at 533 Mtransfers/sec and PCIe bandwidth clearly increases proportionately with FSB speed. The two-link read test (RD RD) shows 177 Mbytes/sec combined bandwidth for the 1.6-GHz platform versus 133 Mbytes/sec for the 1.1-GHz platform.





Figure 5: Intel® SCH US15W Test Results, 128-Byte blocks (1.6 GHz)





Figure 6: Intel® SCH US15W Test Results, 64-Byte blocks (1.6 GHz)



There's an obvious bandwidth difference between the single-link read and single-link write tests. The 1.1-GHz platform delivers equivalent bandwidths of approximately 130 Mtransfers/sec for single-link reads and writes. The 1.6-GHz platform with the faster FSB does not exhibit equal single-link read and write bandwidths. For the 1.6-GHz platform, test results show that 64-byte writes are significantly faster than 128-byte writes and 128-byte reads are significantly faster than 64-byte reads. (For a deeper explanation of why, see the Reference.)




You can optimize PCIe device performance by harmonizing packet size with memory use. Based on the above test results, read blocks should be 128 bytes or larger and write blocks should be 64 bytes for optimal performance. PCIe devices attached to the Intel® SCH US15W chipset cannot expect constant bandwidth at all times because they share the critical and bandwidth-limited system resource (the backbone) equally with other on-chip peripherals. Consequently, PCIe devices should throttle their requirements to maintain an average throughput to achieve better performance. PCIe devices that attempt to enforce rigid bandwidth requirements with inadequate buffers to soak up latency variation will suffer.



Have you experienced similar bandwidth limitations? How did you overcome these issues? Did you use similar analysis techniques or radically different ones?





1. Scott Foley, Performance Analysis of the Intel® System Controller Hub (Intel® SCH) US15W, Intel® Corp, Intel® Technology Journal, Volume13, Issue 1, 2009, pages 6-15.



Steve Leibson

Roving Reporter (Intel Contractor)

Intel® Embedded Alliance