Disruptive innovation in infrastructure is on the rise, and nowhere is that more evident than in the Software Defined Networking movement. But while much of the SDN discussion has focused on the data center, the better initial use case might be in the wide area network. One advocate of that approach is Michael Elmore, IT Senior Director of the Enterprise Network Engineering Infrastructure Group at Cigna, a global health service company headquartered in Bloomfield, Connecticut. Michael is also on the board of Open Network Users Group (ONUG). Network World Editor in Chief John Dix asked Elmore to participate in an email-based Q&A to explore the promise of Software Defined WANS.
The members of the Open Network User Group that you are a member of have voted the WAN as the top use case for SDN twice in a row now. Why do you think that is?
Consider this quote from a Wall Street firm at the recent Open Networking User Group meeting in New York: “Although much of Wall Street has focused on the ‘sexy’ datacenter aspect of SDN, interest in software-defined WAN has increased meaningfully and we believe SD-WAN could experience more rapid adoption than datacenter overlay technologies. SD-WAN can dramatically reduce the cost of WAN deployments by enabling cheaper bit rates in both CAPEX and OPEX (i.e., less cost for the same bandwidth or more bandwidth for the same cost as compared to MPLS) and less overprovisioning for the same SLAs.”
What’s more, the WAN tends to be more discreet in terms of organizational teams and the technology stack itself, meaning organizations can move faster to embrace SD-WANs. So, if you’re interested in building a WAN that is better, faster and cheaper, there are some key issues to consider.
What WAN issues today would encourage a company to start exploring SD-WAN options?
There are many challenges and limitations with the predominant MPLS-based layer 3 VPN service offerings that have become the standard connectivity solution for many Fortune 500 companies over the past 15 years. Although these solutions have served the enterprise well in a time of limited options, the market is opening up and ripe for transformation.
Previous attempts to scale VPN overlays have not found their way to mainstream, due to protocol scalability limitations and the sheer configuration complexity required for a reasonably sized enterprise network. As more and more critical business applications – such as voice, contact center and storage applications – converge to an IP transport, a high-performing and ultra-resilient (self-healing) IP WAN fabric will become essential to the business.
Let’s examine the WAN challenges today:
Cost:
- The access cost component for MPLS services provided by Tier I service providers continues to be a challenge. Global and national providers are at the mercy of their wholesale relationships with the local exchange carriers and tend to pass these costs to the consumer, with a potential mark-up. Additional cost components include everything from the number of routes, multicast support and QoS requirements, all of which further inflate costs.
Scale:
- It seems the MPLS provider’s control plane and forwarding information base (FIB) tables are hitting scale limitations, causing providers to police the number of routes they are willing to accept from a customer. For the enterprise, this means more front-end negotiation, risk of hitting these policed thresholds, and ultimately the risk in dropping routes, as well as the cost the SPs incur (and potentially pass on) with the constant churn of hardware and perpetual maintenance to support the increased demand in Provider Edge (PE) and backbone capacity.
Service quality:
- WANs today are not application aware, nor do they consider different application performance thresholds. Soft failures/regional brown-outs can have catastrophic impact on real-time applications.
- SLAs are only as good as the customers’ ability to measure these and hold the providers accountable. Whether it’s latency, jitter, packet loss or the absolute number of outages allowed per month, all of this requires significant management overhead. Although sourcing teams and enterprise service owners are focused when negotiating a predetermined financial penalty for a specific SLA breach, often these breaches render more material impact to the business, which cannot be compensated by collecting an SLA credit. How does an enterprise protect its net promoter score for a customer call they may have dropped, due to a regional outage?
- Service provider maintenance is sometimes uncoordinated, resulting in unplanned business impact.
- Time to detect failures and restore service is often elongated. Both hard down and soft failure detection requires synchronization between the service provider’s control plane and the customer’s control plane (bifurcation of control planes). Customers can tune the edge timers; however, they remain dependent on the provider’s backbone to detect, hold down, withdrawal and prorogate the updates. This holds true for dual-carrier MPLS architectures as well, where customers rely on carrier A to withdraw the associated prefix(s) in a hard outage situation, so the disparate topologies can converge and restore the session path. It gets worse with a brown-out or regional outage, where carrier A would never withdraw the prefix(s), yet causes application degradation.
Security:
- There is no inherent data plane encryption. Some customers elect to implement over-the-top IPsec, which tends to impede the benefits of MPLS by decreasing overall scale, while adding an additional fault domain layer. Additionally, this requires distributed configuration steps for setup and key management.
Visibility:
- The customer’s Layer 3 routing control plane is outsourced to the MPLS service provider, as customers are required to inject their remote site routing table into the SP’s network, either statically or dynamically. At this point, the customer loses visibility with very limited access the provider edge, not to mention the backbone.
- Managing multi-homed default route selection in a single VRF requires the customer to provision site-of-origin (SOO) via a route map on the provider edge, with limited means to validate the configuration / implementation. This type of manually steering of traffic can take days, if not weeks to implement. The risk: outbound traffic destined to the closest exit point could suddenly transition to another multi-homed exit point causing latency and application lag.
- Most SPs prohibit SNMP access to the premise equipment for proactive alerting and instrumentation, limiting visibility into what is happening in the MPLS “underlay.”
Agility/flexibility:
- Time to provision is typically elongated and unpredictable when compared to the consumer market. How is it possible for a consumer to provision 10 to 250Mbps service in a few days or weeks to their home, yet it takes a corporate network administrator typically 60-90 days to get similar bandwidth provisioned? This is the classic and rigid LEC problem, represented by the wholesale dependency retail service providers have when delivering services to the enterprise customers. The retailers are often dependent on the LECs outside of their own territory. This challenge becomes exacerbated when trying to procure ‘diversity’ for multiple circuits.
- There’s no inherent application-based path selection to facilitate routing cloud-based application access via the local internet.
So if those are the WAN challenges today, what is the SDN promise?
In short, the SD-WAN can enable customers to take back control from service providers, while creating new market opportunities for those service providers.
If customers could create SD-WANs that separate the underlying transport from a software-based, overlay control plane on controller(s) owned by the customer, it would empower them, among other things, to centrally manage security policies and make