The recent SDN & OpenFlow World Congress in Dusseldorf attracted a fascinating mix of attendees. On one side were long-time veterans of the telecom industry exploring the opportunities that virtualization is bringing to service provider networks. On the other side were IT and cloud experts working on the challenges of extending their infrastructure to support telecom services.
The topic bringing these two groups together, of course, is Network Functions Virtualization (NFV). The promise of NFV is that a combination of virtualization and “cloudification” will enable service providers both to reduce their OPEX through improved network efficiency and to improve their top-line revenue through the agile delivery of new, value-added services. In order to successfully achieve this goal, IT teams and networking teams are going to have to work together in unprecedented ways. Each group approaches the challenges from a different perspective and with a different set of experiences.
One area that causes a lot of confusion and misunderstanding for folks with a background in IT and cloud infrastructure is the whole topic of “Carrier Grade” reliability for telecom services. More and more vendors are starting to use Carrier Grade terminology in connection with their products, but the requirements and challenges of Carrier Grade reliability are very different from what many of people have had to deal with before, while the telecom industry of course brings its own alphabet soup of confusing acronyms and terminology.
In this post, we’ll outline some of the myths about Carrier Grade that we often encounter when we’re demonstrating NFV solutions to conference attendees whose main focus until now has been on enterprise-type applications.
Myth #1: Carrier Grade reliability has no direct impact on service provider revenues
In 2014, Heavy Reading published a detailed analysis titled “Mobile Network Outages & Service Degradations” that discussed the business impact of network outages. The report calculated that during the twelve months ending October 2013 service providers worldwide lost approximately $15B in revenue through such outages, representing between 1% and 5% of their total revenues. All major service providers were affected.
There are several sources of this lost revenue. First, there’s the increased rate of subscriber churn (dissatisfied customers take their business elsewhere). Second, there are the operational expenses incurred to fix the problems. Third, service providers lose the ability to capture revenue from a billable service if it’s unavailable. Fourth, future revenues are impacted due to damage to brand reputation. Fifth, refunds must be paid to enterprise customers with Service Level Agreements (SLAs) that guarantee a certain level of uptime. And finally there are inevitably legal costs relating to SLA issues.
It’s important to note that this analysis relates to a 12-month period ending in 2013, when service providers’ infrastructure was completely based on physical equipment, typically with high reliability proven over many years’ deployments and before any adoption of network virtualization.
NFV has the potential to make this situation much worse: services and applications will now be virtualized; they will be new and unproven; VMs will be dynamically reallocated across servers, racks and even data centers; traffic flows will be more complex and hard to debug; solutions will inevitably be multi-vendor rather than from a single supplier.
As they progressively adopt NFV, it’s a business imperative for service providers to maintain Carrier Grade reliability for their critical services and high-value customers. Otherwise their overall uptime will decrease, further impacting their revenues and negating one of the key reasons (top-line growth) for moving to NFV in the first place.
Myth #2: Carrier Grade reliability is a stand-alone “feature” that you can add to your infrastructure
It’s extremely difficult to develop network infrastructure that delivers Carrier Grade reliability. Multiple, complex technologies are needed in order to guarantee six-nines (99.9999%) reliability at the infrastructure level so that services can achieve five-nines uptime.
Looking first at what it takes to guarantee network availability for virtualized applications, an optimized hypervisor is required that minimizes the duration of outages during the live migration of Virtual Machines (VMs). The standard implementation of KVM, for example, doesn’t provide the response time that’s required to minimize downtime during orchestration operations for power management, software upgrades, or reliability spare reconfiguration. In order to respond to failures of physical or virtual elements within the platform, the management software must be able to detect failed controllers, hosts or VMs very quickly launch self-healing actions, so that service impact is minimized or eliminated when failovers occur. The system must automatically act to recover failed components and to restore sparing capability if that has been degraded. To do this, the platform must provide a full range of Carrier Grade availability APIs (shutdown notification, VM monitoring, live migration deferral, etc.), compatible with the needs of the OSS, orchestrator and VNFs. The software design must ensure there is no single point of failure that can bring down a network component, nor any “silent” VM failures that can go undetected.
Second, network security requirements present major challenges. Carrier Grade security can’t be implemented as a collection of bolt-on enhancements to enterprise-class software, rather it must be designed-in from the start as a set of coordinated, fully-embedded features. These features include: full protection for the program store and hypervisor; AAA (Authentication, Authorization and Accounting) security for the configuration and control point; rate limiting, overload and Denial-of-Service (DoS) protection to secure critical network and inter-VM connectivity; encryption and localization of tenant data; secure, isolated VM networks; secure password management and the prevention of OpenStack component spoofing.
Third, a Carrier Grade network has stringent performance requirements, in terms of both throughput and latency. The host virtual switch (vSwitch) must deliver high bandwidth to the guest VMs over secure tunnels. At the same time, the processor resources used by the vSwitch must be minimized, because service providers derive revenue from resources used to run services and applications, not those consumed by switching. The data plane processing functions running in the VMs must be accelerated to maximize the revenue-generating payload per Watt. In terms of latency constraints, the platform must ensure a deterministic interrupt latency of 10µs or less, in order for virtualization to be feasible for the most demanding CPE and access functions, such as C-RAN. Finally, live migration of VMs must occur with an outage time less than 200ms, using a “share nothing” model in which all a subscriber’s data and state are transferred as part of the migration. The “share nothing” model, used in preference to the shared storage model in enterprise software, ensures that legacy applications are fully supported without needing to be rewritten for deployment in NFV.
Finally, key capabilities must be provided for network management. To eliminate the need for planned maintenance downtime windows, the system must support hitless software upgrades and hitless patches. The backup and recovery system must be fully integrated with the platform software. And support must be implemented for “Northbound” APIs that interface the infrastructure platform to the OSS/BSS and NFV orchestrator, including SNMP, Netconf, XML, REST APIs, OpenStack plug-ins and ACPI.
You can’t achieve these challenging requirements by starting from enterprise-class software that was originally developed for IT applications. This type of software usually achieves three-nines (99.9%) reliability, equivalent to a downtime of almost nine hours per year.
Myth #3: Carrier Grade reliability can be implemented in the network applications themselves
There’s been a lot of industry discussion recently about Application-Level High Availability (HA). This concept places the burden of ensuring service-level reliability on the applications themselves, which in an NFV implementation are the VNFs. If it’s achievable, it’s an attractive idea because it means that the underlying NFV Infrastructure (NFVI) could be based on a simple open-source or enterprise-grade platform.
Even though such platforms, designed for IT applications, typically only achieve three-nines reliability, that would be acceptable if the applications themselves could recover from any potential platform failures, power disruptions, network attacks, link failures etc. while also maintaining their operation during server maintenance events.
Unfortunately, Application-Level HA by itself doesn’t achieve these goals. No matter which of the standard HA configurations you choose (Active / Standby, Active / Active, N-Way Active with load balancing), it won’t be sufficient to ensure Carrier Grade reliability at the platform level.
In order to ensure five-nines availability for services delivered in an NFV implementation, you need a system that guarantees six-nines uptime at the platform level, so that the platform can detect and recover from failures quickly enough to maintain operation of the services. This implies that the platform needs to deal with a wide range of disruptive events which cannot be addressed by the applications because they don’t have the right level of system awareness or platform management capability.
For anyone involved in architecting, developing or deploying any part of an end-to-end NFV solution, this new white paper “NFV: The Myth of Application Level HA” is required reading. It provides a detailed technical analysis of the tradeoffs between Application-Level HA and Carrier Grade platforms and gives a clear direction to follow.
Myth #4: Carrier Grade reliability is something you get from the OPNFV project
Formally launched in September 2014, the Open Platform for NFV (OPNFV) project is an open source reference platform intended to accelerate the introduction of NFV solutions and services. OPNFV operates under the Linux Foundation and the primary goal of the project is to implement the ETSI specification for NFV.
Several service providers have been quoted publicly as confirming that they see the OPNFV reference platform as a way to accelerate the transition from the standards established by ETSI to actual NFV deployments. Of course they recognize that OPNFV code can’t be directly deployed into live networks, anticipating that software companies will use OPNFV as the baseline for commercial solutions with full SLA support.
OPNFV’s initial focus is NFV Infrastructure (NFVI) and Virtualized Infrastructure Management (VIM) software, implemented by integrating components from upstream projects such as OpenDaylight, OpenStack, Ceph Storage, KVM, Open vSwitch and Linux. Along with application programmable interfaces (APIs) to other NFV elements, these NFVI and VIM components form the basic infrastructure required for hosting VNFs and interfacing to Management and Network Orchestration (MANO).
The first OPNFV release “Arno” became available in June 2015. Arno is a developer-focused release that includes the NFVI and VIM components. The combination offers the ability to deploy and connect VNFs in a cloud architecture based on OpenStack and OpenDaylight. The next release “Brahmaputra” is planned as the first “lab-ready” release, incorporating numerous enhancements in areas such as installation, installable artifacts, continuous integration, improved documentation and sample test scenarios.
Neither Arno nor Brahmaputra, however, incorporates any features that contribute to delivering Carrier Grade reliability in the NFVI platform. This is an example of an area where companies with proven experience in delivering six-nines infrastructure will continue to add critical value.
Solutions such as Wind River’s Titanium Server build on community-driven reference code and enhance it with functionality that is an absolute requirement for platforms deployed in live service provider networks, while remaining fully compatible with all the applicable open standards.
At SDN & OpenFlow World Congress, we enjoyed exploring these topics with attendees who stopped by our booth to see a comprehensive demonstration of a proven Carrier Grade NFV cloud solution that’s already been selected by multiple customers. The folks whose background was primarily in IT or cloud applications quickly developed a whole new appreciation for the complexities associated with guaranteeing the level of reliability that’s an absolute requirement in the world of telecom.
If you missed us in Dusseldorf, or simply want to learn more about how we deliver NFVI with the performance and uptime that service providers require, visit our Titanium Server website.