Transiency-driven Resource Management for Cloud Computing Platforms

27 Jun
Add to Calendar
Wednesday, 06/27/2018 1:00pm to 3:00pm
Computer Science Building, Room 151
Ph.D. Thesis Defense
Speaker: Prateek Sharma

Cloud computing platforms form the bedrock of today's computing ecosystem, and provide computing resources for applications in data science, scientific computing, and online web services. Today's cloud platforms run ever more complex applications with diverse requirements, resulting in new challenges in efficient use of cloud resources----both from an application and system design perspective. Increasingly, clouds and data centers are moving towards transient computing, a new model for resource allocation, that improves efficiency and reduces cost. However, transient servers can be unilaterally revoked by the cloud operator, and this uncertain availability results in loss of application state, application downtime, and performance degradation.

In this thesis, we identify and address some of the challenges in mitigating revocations and managing transient resources, by developing abstractions, policies, and systems for running a wide range of applications on low-cost transient servers.  First, I will describe a resource management technique, called server portfolios, that is inspired by financial portfolios, that enables distributed applications to effectively use low-cost cloud transient servers.  Server portfolios have been implemented as part of ExoSphere, a cluster management system for transient cloud servers,  that runs a wide range of applications such as Spark, MPI, and BOINC, and reduces computing costs by as much as 10x. 

The second part of the talk will describe a technique called resource deflation, that allows cloud platforms to provide low-cost resources that are not revoked.  Our resource deflation based system combines virtual machine overcommitment mechanisms with CPU performance counter based knee-finding, to dynamically adjust a VM's resource allocation across a wide range, without large performance losses. The resulting system allows clusters to be overcommitted by over 2x. Finally, I will discuss future research directions in transient computing.

Advisor: Prashant Shenoy