Faculty Recruiting Support CICS

Traffic Engineering in Planet-scale Cloud Networks

25 Feb
Add to Calendar
Thursday, 02/25/2021 2:00pm to 4:00pm
Zoom Meeting
PhD Thesis Defense
Speaker: Rachee Singh

Zoom Meeting: https://zoom.us/j/3121059762?pwd=OE8vTkdjMDN5dnJpSHFlbjdHeENsUT09
Meeting ID: 312 105 9762
Passcode: 56Xu7a

Abstract:

Cloud wide-area networks (WANs) play a key role in enabling high performance applications on the Internet. Cloud providers like Amazon, Google and Microsoft, spend over hundred million dollars annually to design, provision and operate their WANs to fulfill the low-latency, high-bandwidth communication demands of their clients. In the last decade, cloud providers have rapidly expanded their datacenter deployments, network equipment and backbone capacity, preparing their infrastructure to meet the growing demands. This thesis re-examines the design and operation choices made by cloud providers in this phase of exponential growth along the axes of network performance, reliability and operational expenditure using empirical evidence from a large commercial cloud provider. In this thesis, I develop software-defined traffic engineering systems to remedy the inefficiencies in the operation of cloud networks revealed by the empirical analysis.

First, I demonstrate how knowledge of optical signal quality can lead to a 75% increase in capacity for 80% of the optical wavelengths in the cloud backbone. I show that optical wavelengths can sustain 175 Gbps or higher capacity but they were being utilized for a conservative 100 Gbps only, leaving 145 Tbps of network capacity on the table. This gain stems from the fact that operators have been conservative in utilizing the fiber out of concerns for network reliability. My analysis shows that by dynamically adapting link capacities, it is possible to have the best of both worlds - gains in network capacity with fewer link failures. I develop a traffic engineering controller for the WAN that dynamically adapts link capacities in response to changing optical signal quality. The rate adaptive wide-area network (RADWAN) controller reclaims terabits of network capacity from the existing cloud WAN infrastructure while preventing 25% of link failures. 

Second, I demonstrate cost inefficiencies in the design of cloud optical backbones. Cloud providers traditionally operate point-to-point inter-regional networks - where optical signals are converted to electrical signals and back at every geographical region. However, the conventional design does not keep in view the nature of traffic demands and the traffic flow imposed by them. My analysis shows that 60% of traffic traversing through 30% of geographical regions in the WAN is passing through - neither originating, nor terminating at the region. The pass-through or transit traffic undergoes wasteful optical- to-electrical-to-optical (OEO) conversions at all intermediate regions in point-to-point networks, occupying scarce optical line- and router ports. These ports contribute a majority of the cost of provisioning capacity in cloud networks with existing fiber deployments. I design and implement Shoofly, a network design tool that minimizes hardware costs of provisioning long-haul capacity by optically bypassing network hops where conversion of signals from optical to electrical domain is unnecessary and uneconomical. Shoofly provisions bypass-enabled topologies that meet 8X the present-day demands using existing network hardware. Even under aggressive stochastic and deterministic link failure scenarios, these topologies save 32% of the cost of long-haul capacity.

Finally, I analyze inter-domain bandwidth costs that comprise a significant amount of the operating expenditure of cloud providers. Traffic engineering systems at the cloud edge attempt to strike a fine balance between minimizing costs and maintaining the latency expected by clients. The nature of this tradeoff is complex due to non-linear pricing schemes prevalent in the market for inter-domain bandwidth. I quantify this tradeoff and uncover several key insights from the link-utilization between a large cloud provider and Internet service providers. Based on these insights, I develop Cascara, a cloud edge traffic engineering controller to optimize inter-domain bandwidth allocations with non-linear pricing schemes. Cascara exploits the abundance of latency-equivalent peer links on the cloud edge to minimize costs without impacting latency significantly. Extensive evaluation on production traffic demands shows that Cascara saves 11-50% in bandwidth costs per cloud PoP, while bounding the increase in client latency by 3 milliseconds.

Advisor: Phillipa Gill