Note: This blog entry is based on the RadiSys Technical White Paper: Achieving Backplane Redundancy in AdvancedTCA Systems, June 2009. RadiSys is a Premier member of Intel® Embedded Alliance.
Communications service providers are driving toward “five nines” (99.999\%) network uptime (better than five minutes of unscheduled downtime per year) and require high-availability (HA) systems to achieve that goal. HA network systems reduce the number and duration of network failures and mitigate operational problems caused by external influences such as human error and natural disasters. Redundancy is a key attribute of most HA systems. Platforms based on the Advanced Telecommunications Computing Architecture (AdvancedTCA or ATCA) achieve much of their redundancy through hardware backplane redundancy. These platforms are becoming the foundation of today’s global wireless and wireline networks.
Redundancy Benefits and Trade-offs
ATCA is a PICMG standard that addresses the stringent performance and reliability needs of today’s telecommunications network elements. The ATCA architecture is inherently fault tolerant. All of the common hardware elements—power, shelf management, and backplane interconnect—are redundant. For example, an Ethernet fabric with a redundant, dual-star configuration (Figure 1) is by far the most common backplane implementation for ATCA systems.
Figure 1: ATCA system with dual-star Ethernet backplane topology
Figure 1 illustrates three types of interconnect between hub/switches and node blades:
- Fabric Interconnect (red lines): The high-speed fabric interconnect network commonly serves as the conduit for payload or data-plane traffic. The fabric interconnect usually employs one of the high-speed Ethernet standards but is not limited to Ethernet standards by the ATCA specifications. Other types of high-speed serial protocols used for ATCA fabric interconnect include InfiniBand, StarFabric, PCI Express, and RapidIO.
- Base Interconnect (dark blue lines): This lower-speed interconnect is commonly used for management traffic and is defined as using 10/100/1000 BASE-T Ethernet connections among boards. For low-bandwidth systems, both control-plane and data-plane data could flow through the base interconnect instead of the fabric interconnect.
- IPMI (Intelligent Platform Management Interface, gray lines): ATCA systems use IPMI interconnect for shelf-management tasks such as hot swapping modules, e-keying, and power management. IPMI (specifications here) is based on the low-speed I2C serial bus interface.
For certain systems architectures, the availability of base and fabric interconnect allow the ATCA system to separate management traffic from data-plane traffic, which helps prevent security breaches and improves overall system performance and improves available bandwidth within the system by routing different types of inter-board traffic on different interconnect resources.
System designers can use several redundancy models to create HA systems from ATCA components. You can apply each of the redundancy models described below to each of the three interconnect paths described above.
ATCA Redundancy Schemes
Most commonly, network elements employ two types of redundancy schemes: 2N and N+1.
- 2N means each system resource has a corresponding standby hardware module. This redundancy scheme provides a hot standby module that constantly synchronizes with the active module and takes an active role during a failover process. Although effective, a 2N redundancy approach is cost-prohibitive—it essentially doubles the system’s hardware cost—so it’s typically used only in applications where system state absolutely cannot be lost.
- N+1 is a much more cost-effective redundancy scheme than 2N. N+1 redundant systems incorporate one spare module for N active modules. For example, a typical 14-slot ATCA chassis configured for N+1 redundancy might contain 12 compute blades. Eleven of these compute blades provide active mission-mode processing while the twelfth blade serves as the standby blade. N+1 redundancy provides operational backup while incurring much less hardware cost than the 2N scheme to achieve redundancy.
Middleware runs the HA hardware
Middleware manages the resources within HA systems to deliver redundancy through fault detection and fault isolation. Specifically, middleware performs following roles:
- Continually collects operational data
- Maintains redundancy-group mapping
- Checkpoints resources and manages their group membership
- Provides fault detection and isolation
- Initiates system recovery after a fault occurs
Failures can occur anywhere: on links, within system nodes, in external I/O, or within the switch itself. HA systems must be designed to handle any fault with minimal traffic loss. Figure 2 shows a simplified view of a digital network communications system to help explain the following four common failure scenarios:
Figure 2: Simplified Failover System Model
Failure scenarios (corresponding to the numbers in the dashed circles in Figure 2 above):
- If the active Node x hosting application fails, standby Node y becomes the active node, assumes control of that application, and continues service. Middleware initiates a failover mechanism on the node blades to transfer the application.
- If the active link between the switch and Fabric Switch A fails, traffic is routed through standby Fabric Switch B. Middleware reconfigures all the node blades in the system to send traffic through a link connected to Fabric Switch B.
- If Fabric Switch A itself fails, active application Node x sends traffic through the standby link on Fabric Switch B. Middleware senses the failure and functions exactly as described in scenario 2.
- If a network interface to Fabric Switch A fails, the corresponding standby link on Fabric Switch B becomes active and sends traffic to VLAN y. Here, middleware acts only upon the Fabric Switch ports, not the Fabric Switches themselves.
Have you incorporated redundancy into your ATCA systems? Are the redundancy models, modes, and methods listed above sufficient for your application needs? Is there something better that might help you build redundancy into your next high-availability system?
Roving Reporter (Intel Contractor)
Intel® Embedded Alliance