High reliability is a critical requirement for many applications, particularly for military, aerospace, medical, transportation, and other applications where lives are at stake.  Many of these applications require ruggedized systems that can withstand extreme temperatures, thermal and dynamic shock, vibration and G-forces.  In this blog we will highlight key methodologies for designing rugged, reliable systems.

 

Design margin—that is, designing a product to perform beyond the conditions expected in the field—is a key concept in designing for high reliability.  A product with greater design margin will, on average, have higher long term reliability than a product with a lower margin.

 

Highly Accelerated Life Testing (HALT) and Highly Accelerated Stress Screening (HASS) are two processes commonly used to assess, improve, and monitor design margin. HALT is used to asses a product’s physical limitations, and to expand design margins prior to manufacture. HASS is used to test products during manufacture to verify that the design margins are maintained.

 

HALT determines a product’s limitations by progressively stressing the product until it fails.  HALT typically involves testing the product under combinations of vibration and thermal stresses.  Vibration testing occurs in six degrees of directional and rotational movement, as illustrated in Figure 1.  The level of shock and vibration is increased in increments, usually 3-5 Grms at a time. For each level of vibration, the vibration frequency is varied randomly. Similarly, thermal testing steps the temperature up or down in 10 degree Celsius increments as shown in Figure 2.  At each stepping (either in vibration or temperature) the system under stress undergoes full functional testing to ensure proper operation.

 

416i9EF6E74829ED5406

Figure 1. Unlike traditional vibration tests, HALT covers six degrees of vibration movement (courtesy RadiSys).

 

415i7ACFC6EF74C1F80C

Figure 2. Hot and cold step stresses (courtesy RadiSys).

 

HALT is often used as a part of an iterative design process to identify and correct failure modes.   The idea is to push the product until it fails, identify the source of the failure, and then redesign the product to eliminate the weakness.  This process is repeated until further improvement becomes impractical, thus maximizing the design margin.  Figure 3 illustrates common failure modes uncovered by HALT.

 

Vibration and Thermal Stress

Thermal Stress

Bad solder joints (e.g., metal composition)

Poor component placement (excessive loads, noise)

Surface mount issues (bonding strength)

IC quality problems (excessive leakage, electrostatic discharge—ESD)

Low mechanical tolerance (expansion coefficients)

Insufficient electrical tolerance (signal quality)

Raw board problems (e.g., delamination)

Timing issues (signal skew, setup/hold time)

Figure 3. Types of defects identified using HALT (courtesy RadiSys).

 

Once the design is finalized and the limitations have been characterized, it is time to move on to production and HASS testing.  As explained in Figure 4, HASS is intended to ensure that manufactured products have a certain level of operating margin, not to push products to failure.  Thus, the HASS conditions are established by backing off from the physical limits found through HALT testing to a level that will not take excessive life out of the product.

 

HALT

HASS

Stresses products until they fail

Stresses products without excessively reducing product life

Performed on a limited number of pre-production products

Performed on every product during manufacturing

Determines product limitations and failure modes

Ensures products have sufficient margin

Figure 4. Key differences between HALT and HASS.

 

HALT and HASS are excellent tools for increasing a product’s reliability, but these tests cannot be used to produce reliability figures such as the mean time between failure (MTBF). MTBF is best determined using life tests and field data—we’ll come back to this topic in a moment.

 

From the high-level discussion we’ve had so far, it may appear that design for reliability is a simple matter of performing a few tests.  Dig deeper, however, and you will discover that designing for reliability is a complex and difficult task.   In many cases the easiest path to a reliable design is to use ruggedized commercial off-the-shelf (COTS) hardware.

 

One reason to use a COTS vendor is that HALT/HASS testing is not easy.  Setting up tests to be repeatable and to reflect real-world conditions requires considerable experience.  Poorly-designed tests can lead to serious problems.  For example, poorly-designed HALT testing may uncover a primary failure mode but not secondary failure modes.  If this happens, a product can pass lab tests with flying colors but fail in the field.  As another example, poorly-designed HASS testing can seriously reduce product life.

 

COTS vendors who have experience with HALT/HASS testing can avoid these problems.   For example, ADLINK offers an Extreme Rugged* product line that draws on a 20-year history of rugged design.  Recent additions to this product line include the MilSystem* 840 and MilSystem* 735 COTS military computers based on the Intel® Core™2 Duo and Intel® Atom™ N270 processors, respectively.   

 

Another reason to use COTS systems is that is it very difficult to determine failure thresholds and tune HASS screens with a small number of test units (see A Look Under the Hood of HALT and HASS).  These concerns are not an issue for large COTS vendors, who have sufficient volume to thoroughly test boards, chassis, assemblies, etc.  Similarly, COTS vendors have the field data needed to arrive at robust MTBF projections.  For example, RadiSys uses two methods to arrive at MTBF calculations:

  • Sum of Failure Rates: A product MTBF is derived by adding up the failure rate of every individual board component. Component failure rates, based on millions of operational hours in the field, are found in the MIL-HDBK-217 and Telcordia (Bellcore) SR-332 databases.
  • Demonstrated Reliability: RadiSys tracks every unit it ships, and all field failures, in order to calculate a RadiSys-specific MTBF based on cumulative data collected over twenty years.

 

This data is part of the “secret sauce” RadiSys brings to its ruggedized products like the Procelerant* CEQM57XT COM Express module, which combines Intel® Core™ i7 and i5 processors and the mobile Intel® QM57 Express chipset. This module is designed for mil/aero, medical and industrial applications, unmanned vehicles, and in-vehicle computers.

 

Finally, design for reliability often requires consideration of factors beyond vibration and temperature.  For example, Kontron looks at factors such as resistance to dust and water, power management, thermal flow and cooling characteristics, and even chassis design consideration such as drive cushioning and overall system layout.  By looking at these factors, Kontron is able to deliver systems that achieve an MTBF of up to 50,000 hours.

 

There is much, much more I could write on this subject.  I hope I’ve left you with a good grasp of the basics and an appreciation of the depth and complexity of this issue.  I encourage you to follow the links in this article, and to check out these related items:

 

Kontron and RadiS ys are Premier members of the Intel® Embedded Alliance.  ADLINK is an Associate member of the Alliance.

 

 

Kenton Williston

Roving Reporter (Intel Contractor)

Intel® Embedded Alliance

Editor-In-Chief

Embedded Innovator magazine