PhD Dissertation Proposal: Qizheng Yang, Serving Deep Learning Models at the Quality-Cost Frontier
Content
Speaker:
Abstract:
Modern AI services increasingly rely on deep neural networks and foundation models to deliver high-quality results under tight latency and resource constraints. However, practical deployments face two persistent challenges: (1) inference workloads are dynamic, with traffic fluctuations and diverse query characteristics, and (2) high model quality typically requires larger models or more computation, directly increasing inference latency, which is the primary cost metric in serving systems, as it determines user-perceived responsiveness, hardware utilization, and throughput. This thesis investigates how to design high-throughput, cost-efficient inference serving systems that adapt to time-varying workloads and explicitly optimize quality-latency trade-offs, thereby reducing serving cost while ensuring high performance and response quality.
First, we introduce a software framework that accelerates multi-DNN inference by exploiting redundancy across multiple well-trained models. The framework proposes a model fusion approach that shares intermediate computation across models through graph mutation, reducing end-to-end inference latency cost while maintaining task accuracy. Next, we study diffusion model serving for text-to-image generation, where quality and cost are strongly coupled to inference computation. We propose a query-aware model cascade that allocates computation selectively based on prompt difficulty, improving throughput and cost efficiency without sacrificing output quality, and formulate resource allocation as a mixed-integer linear program (MILP) to jointly optimize the routing threshold, server allocation, and batch sizes under time-varying demand. Building on this, we develop a hybrid serving system that combines prompt-side routing with output-side discrimination to make more reliable routing decisions under uncertainty, and further supports adaptive model-pair selection to dynamically choose the optimal cascade configuration given the current latency budget and workload.
We then propose two extensions that broaden the scope of quality-cost optimization to emerging AI workloads. First, we propose to build a query-aware serving system for multimodal large language models (MLLMs). Unlike text-only LLMs, MLLMs introduce distinct latency costs from vision encoding, visual token explosion in the KV cache, and highly heterogeneous input sizes. More importantly, applying query-aware routing in this setting surfaces new challenges: routing signals must now account for both textual and visual query characteristics, routing decisions interact with multimodal prefill scheduling and KV cache pressure in ways that do not arise in text-only serving systems, and resource allocation must be co-optimized with routing under these compounded constraints. Second, we propose to optimize the latency cost of multi-agent AI workflows, where the execution dependencies of specialized agents form a directed acyclic graph (DAG) and the structure of that graph determines end-to-end latency, with the goal of minimizing latency cost while preserving comparable task quality.
Advisor:
Hui Guan