At the recent SDN and OpenFlow World Congress in Dusseldorf, I was invited to give a talk about the cost of downtime in telecom networks and how this financial impact might be affected by NFV. This seemed to be a topic of wide interest, at least within the NFV-focused audience at the event. So in this post, I’ll summarize some of the information that I covered and suggest how we as an industry can address this challenge.
In October 2013, Heavy Reading published a comprehensive analysis titled “Mobile Network Outages and Degradations”. You can download a short version here and it provides excellent information on this topic
The report contains some thought-provoking numbers, starting with the fact that network outages cost service providers approximately $15 Billion a year, generally representing between 1% and 5% of their annual revenue. That’s a massive impact on their P&Ls, especially at a time when network infrastructure costs are exploding because of the growth in video traffic while per-subscriber revenues are flat to declining.
There’s a fascinating chart in the report that illustrates how many “major” outages are suffered by service providers in a typical year. While 27% of operators said they average only one to three major outages per year, as many as 12% suffer between 15 and 20, while 20% suffer more than 20. Clearly, major outages are not infrequent events.
It’s also interesting to read about the financial impact of these network outages. The largest impact is the increase in subscriber churn and of course it’s always more expensive to acquire new customers, especially high-revenue enterprises, than to retain existing ones. Other significant impacts are the operational expenses to fix the problems as well as the loss of ability to capture revenue from billable services. Slightly lower in terms of direct financial impact, but still significant, are the cost of refunds paid directly to customers and, inevitably, the legal costs relating to Service Level Agreement (SLA) issues.
With this report being published in October 2013, it’s safe to assume that it reflects traditional physical infrastructure incorporating a negligible amount of network virtualization. The networks from which these numbers were derived would have been based on fixed-function, vertically-oriented equipment, typically developed by a Telecom Equipment Manufacturer (TEM) employing their proprietary technology at every level of the architecture. Evolved over many years, this physical infrastructure typically delivers six-nines (99.9999%) reliability, which enables the services running on it to deliver the five-nines (99.999%) uptime expected by customers, and especially the high-revenue enterprises with stringent SLAs.
So what happens when the industry moves to NFV and we start replacing this fixed-function equipment with horizontally-oriented, multi-vendor solutions based on open hardware and software standards? From the perspective of service reliability, NFV has the potential to make the situation a lot worse (although there is a way to solve it).
As an example of new challenges, the services provided by NFV-based infrastructure will be delivered by Virtual Network Functions (VNFs). In some cases these will be virtualized implementations of existing software and in others they will be brand new applications. Either way, though, they will lack the proven track record of the applications running in today’s physical infrastructure, they will incorporate the added complexity of virtualization and we can be sure that they will fail more often.
Similarly, a core principle of NFV is the dynamic reallocation of VMs across servers, racks and data centers. This brings improved operational efficiency and enables seamless scale-up and –down of applications as traffic patterns change. It also increases the number of potential failure points.
Likewise, the traffic flows through new, virtualized systems will be complex and extremely hard to debug, even with the advent of innovative testing and monitoring applications that themselves run as VNFs. Regardless of the sophistication of these new tools that are in development, it’s a safe bet that when outages do occur that require manual intervention, it will take a lot longer to debug them, at least in the early years of NFV.
So how do we address this problem and ensure that service providers can maintain the traditional, expected level of service uptime?
The key is that, even with the move to NFV, the network infrastructure needs to provide the six-nines reliability that enables it to detect and respond to both hardware and software problems quickly enough that the services can maintain five-nines uptime. This is what’s meant by “Carrier Grade” reliability and it requires the implementation of a number of critical functions, such as:
- At least 500km geographical redundancy for continued operation in natural disaster scenarios, such as earthquakes;
- The detection of failed Virtual Machines (VMs) in less than one second, with automatic restart and no silent failures;
- A deterministic interrupt latency of 10µs or less in the hypervisor, allowing the virtualization of CPE and access functions;
- Automatic restart and recovery from host failures;
- A fully-redundant, auto-synchronized network control plane;
- Accelerated live VM migration to ensure minimal downtime during planned maintenance;
- Telecom-grade AAA (Authentication, Authorization and Accounting) security;
- A host of other complex features too numerous to list here.
You can’t achieve these challenging requirements by starting from enterprise-class software that was originally developed for IT applications. This type of software usually achieves only three-nines (99.9%) reliability, equivalent to a downtime of almost nine hours per year. That’s only one-thousandth of the reliability that’s needed for telecom.
Fortunately for the industry, a full Carrier Grade NFV infrastructure solution is now commercially available and was demonstrated at SDN and OpenFlow World Congress, with a great reception from service providers, TEMs and analysts. This is the kind of solution that’s required to ensure that the OPEX benefits of NFV aren’t wiped out by the financial impact of network outages resulting from the complexity of this new architectural concept.