PhD Dissertation Proposal: Roozbeh Bostandoost
Content
Speaker:
Abstract:
This thesis presents a comprehensive investigation into the design, implementation, and analysis of resource allocation systems (mechanism) for modern datacenters. We argue that the evolution of datacenter operations—driven by new sustainability and grid integration goals and the integration of machine learning—requires a multi-faceted approach that traditional performance-centric methods can no longer provide. This work is structured in three parts, each addressing a critical facet of this challenge.
Part I redefines the optimization landscape by analyzing the complex trade-offs between multiple, often conflicting, objectives. We begin by demonstrating the fundamental tension between energy efficiency and the emerging goal of carbon efficiency (adaptiveness to ext signals i.e. carbon), establishing the need for new scheduling paradigms. We then move beyond simplistic workload models to propose a dependency-aware scheduling formulation based on the Flexible Job-Shop Problem, revealing that significant carbon reduction is achievable by exploiting the granular structure of complex jobs without compromising performance.
Part II transitions from theoretical models to the development of practical, online algorithms capable of navigating the uncertainty and diversity of production environments. We first introduce a learning-augmented online algorithm that robustly handles unknown job lengths, providing strong theoretical guarantees on both performance and worst-case behavior. We then present a data-driven meta-algorithm that dynamically selects the best scheduling policy from a pool of candidates, ensuring the system remains adaptive to varying grid characteristics and workload mixes.
Part III addresses the crucial final stage of system deployment: ensuring (verifying) operational reliability. We propose a framework for the holistic performance analysis of large-scale, ML-augmented production systems. Using techniques from bi-level optimization, this framework formally analyzes the end-to-end system to identify and quantify practical worst-case performance risks that traditional simulation and testing methods often miss, thereby enabling operators to build trust and mitigate vulnerabilities before they impact production.
Together, these three parts provide a principled and end-to-end methodology for building the next generation of intelligent, adaptive, and verifiable resource allocation systems.
Advisor: