PhD Dissertation Proposal: Xiao Liu, Communication-Efficient Multi-Device Inference for Transformer Models
Content
Speaker:
Abstract:
Transformer-based models have become a central component of modern AI systems, while their continued scaling has made efficient inference increasingly challenging. Multi-device inference provides a natural way to address the compute and memory demands of large models, but its effectiveness depends strongly on inter-device communication. When bandwidth is limited, transmitting intermediate representations can dominate latency and substantially diminish the benefits of parallel execution. A straightforward approach is to compress the transmitted information, yet aggressive compression can significantly degrade model performance when it disrupts essential information flow.
This dissertation studies communication-efficient multi-device inference for Transformer models under bandwidth-limited settings. It begins by establishing two empirical insights that guide the subsequent method design. First, aggressive compression causes much larger performance degradation when it forms an information flow cut in the network, indicating that communication reduction must be designed to preserve essential information pathways. Second, among several strong activation compression baselines, grouped vector quantization is particularly effective for aggressive compression of Transformer activations, identifying a practical compression primitive for bandwidth-limited inference.
Building on these insights, this dissertation develops ASTRA, a sequence-parallelism-based inference framework for Vision Transformers and Large Language Models. ASTRA applies grouped vector quantization to non-local key-value representations, so that communication can be reduced without introducing the severe degradation associated with information flow cuts. It then presents CoDiT, which extends the same communication-efficient design principle to distributed inference for Diffusion Transformers. CoDiT combines vector-quantized context transmission with diffusion-specific mechanisms, including boundary-aware hybrid context and codebook-centric VQ-Attention, to reduce both communication overhead and attention complexity while preserving visual fidelity.
Finally, this dissertation outlines two future directions that further broaden the scope and practicality of this research. The first explores communication-compressed tensor-parallel inference to reduce per-device memory requirements and mitigate the context information loss observed in sequence-parallel methods. The second investigates an agent-driven orchestration system for heterogeneous multi-device environments, with the goal of enabling practical deployment through dynamic model placement and latency-aware request scheduling. Together, these studies advance a unified approach to communication-efficient multi-device inference for Transformer models and establish both algorithmic and system-level foundations for deploying large models under limited communication.
Advisor:
Hui Guan