Many embedded systems are deployed remotely, far from any technical staff that might be able to fix a problem with a malfunctioning computer. Certainly military and &aerospace applications rely on remote systems that operate reliably, but even digital signage, industrial, and retail applications locate business-critical systems that are far from the tech staff. Systems designed for such applications need some ability to automatically recover from faults along with remote-management capabilities. Intel® Architecture (IA) processors can readily support such applications and manufacturers of modular- and system-level IA-based products that target embedded applications often add reliability layers
The watchdog timer is among the most commonly used techniques that can allow a remote system to reboot itself upon some type of system fault. Most manufacturers of IA-based computer modules, single-board computers (SBCs), and embedded-targeted systems include a watchdog timer. For example, the small Advantech* ARK-6310 ruggedized system based on a Mini-ITX board includes a watchdog timer that can be programmed with an interval ranging from 1 to 255 seconds.
Design teams typically utilize the timer to automatically reset a system. A properly-functioning system includes a recurring task that resets the timer on a regular basis. A malfunctioning system that is hung on a task and fails to reset the timer will allow the timer to count all of the way to zero and trigger a system reset.
While most all systems include the watchdog functionality, teams may want to add features that allow a combination of preventive maintenance, remote management, and automatic restoration. Intel® Active Management Technology (AMT) supports remote-management capabilities and as I covered in a recent post is one of several IA features that can deliver mission-critical reliability.
Let’s consider how a couple of third parties have added system reliability features. LiPPERT Embedded Computers**, for example, has a technology suite called LEMT (LiPPERT Enhanced Management Technology) that it is supporting on all of its new SBCs. For example, LEMT is available on the recently announced CoreExpress ECO2 computer-on-module (COM) product that is based on the CoreExpress standard that was originally developed by LiPPERT and that now is being promulgated by the Small Form Factor Special Interest Group (SFF-SIG). The ECO2 (pictured below) is based on the Intel® Atom™ E6xx processor family. The company also supports LEMT on its E6xx-based Toucon-TC COM Express module.
The LEMT technology is based on the combination of a System Management Controller (SMC) IC and an application-layer program that can be accessed locally or remotely via a network connection. The SMC IC is a microcontroller that combines power-sequencing functions needed at boot time with the ability to monitor and control elements of the system.
The LEMT technology essentially enables preventive actions that can keep a system running reliably. The LEMT application can provide details of operating parameters such as system voltages, watchdog status, the current CPU and board temperatures, fan speeds, maximum temperature over a period of operation, and other data.
LEMT works with Windows and Linux systems. It allows a technical team to preempt costly failures or at the very least prepare to replace systems that are vulnerable. Moreover, LEMT combined with AMT will allow a remote team to change system settings and perhaps keep a system operational until it can be replaced.
There are also technologies that help a remote team diagnose a failure and perhaps restore system operation after what would be a fatal fault in many cases. Advantech, for example, has a software product that it calls Advantech eSOS – emergency secondary OS for system recovery. eSOS is a Linux-based secondary OS that is stored in ROM in an embedded system and that is unaffected by any problems that may have impacted the execution of the primary OS. The eSOS doesn’t have the complex set of hardware dependencies that are present in the primary OS and can often be booted by a failed system.
In a typical implementations, an eSOS-based system would first try and reboot when triggered by the watchdog timer. If the
boot to the primary OS fails, the system will boot into eSOS. The eSOS then performs a hardware analysis on the system and emails a detailed report to the remote technical team.
The remote team can connect with the system via telnet or ftp, and attempt to restore system operation. In fact the system will allow a complete restore of a Windows-based operating system. At a minimum, eSOS allows the team to determine the cause of failure and simplify the repair process.
The eSOS technology can be used with a variety of Windows operating systems including XP and Windows Embedded. The technology was initially deployed on the Advantech PCM-9361 SBC and the SOM-5761 COM products. Both are based on the Atom N270 processor. Advantech plans to support other Atom-based products and perhaps other OSs going forward.
Please share you experience with remote mission-critical systems with other followers of the Intel® Embedded Community. Your comments will be greatly appreciated. How have you handle remote monitoring and maintenance? What challenges did you face? What do you think about the LiPPERT and Advantech technologies covered here?
Roving Reporter (Intel Contractor)
Intel® Embedded Alliance
*Advantech is a Premier member of the Intel® Embedded Alliance
** LiPPERT Embedded Computers is an Affiliate member of the Alliance