Skip to main content
The University of Massachusetts Amherst
  • Visit
  • Apply
  • Give
  • Search UMass.edu
Manning College of Information & Computer Sciences

Main navigation

  • Academics

    Programs

    Undergraduate Programs Master's Programs Doctoral Program Graduate Certificate Programs

    Curriculum

    Academic Policies Courses

    Academic Support

    Advising Career Development Scholarships and Fellowships Commencement
  • Research

    Research

    Research Areas Research Centers & Labs Undergraduate Research Opportunities

    Faculty & Researchers

    Faculty Directory Faculty Achievements Turing Award

    Engage

    Research News Distinguished Lecturer Series Rising Stars in Computer Science Lecture Series
  • Community

    On-Campus

    Community, Outreach, and Organizational Learning Student Organizations Massenberg Summer STEM Program Awards Programs

    External

    Alumni Support CICS
  • People
    Full A-Z Directory Faculty Staff
  • About

    Overview

    College Overview Leadership Our New Building

    News & Events

    News & Stories Events Calendar Significant Bits Magazine

    Connect

    Visiting CICS Contact Us Employment Offices & Services
  • Info For
    Current Undergraduate Students Current Graduate Students Faculty and Staff Newly Accepted Undergraduate Students

Breadcrumb

  1. Events

PhD Dissertation Proposal: Xiao Liu, Communication-Efficient Multi-Device Inference for Transformer Models

Content

Friday, April 10, 2026, 3:00 PM - Friday, April 10, 2026, 5:00 PM

Online
PhD Dissertation Proposal Defense

Speaker:

Xiao Liu

Abstract:

Transformer-based models have become a central component of modern AI systems, while their continued scaling has made efficient inference increasingly challenging. Multi-device inference provides a natural way to address the compute and memory demands of large models, but its effectiveness depends strongly on inter-device communication. When bandwidth is limited, transmitting intermediate representations can dominate latency and substantially diminish the benefits of parallel execution. A straightforward approach is to compress the transmitted information, yet aggressive compression can significantly degrade model performance when it disrupts essential information flow.

This dissertation studies communication-efficient multi-device inference for Transformer models under bandwidth-limited settings. It begins by establishing two empirical insights that guide the subsequent method design. First, aggressive compression causes much larger performance degradation when it forms an information flow cut in the network, indicating that communication reduction must be designed to preserve essential information pathways. Second, among several strong activation compression baselines, grouped vector quantization is particularly effective for aggressive compression of Transformer activations, identifying a practical compression primitive for bandwidth-limited inference.

Building on these insights, this dissertation develops ASTRA, a sequence-parallelism-based inference framework for Vision Transformers and Large Language Models. ASTRA applies grouped vector quantization to non-local key-value representations, so that communication can be reduced without introducing the severe degradation associated with information flow cuts. It then presents CoDiT, which extends the same communication-efficient design principle to distributed inference for Diffusion Transformers. CoDiT combines vector-quantized context transmission with diffusion-specific mechanisms, including boundary-aware hybrid context and codebook-centric VQ-Attention, to reduce both communication overhead and attention complexity while preserving visual fidelity.

Finally, this dissertation outlines two future directions that further broaden the scope and practicality of this research. The first explores communication-compressed tensor-parallel inference to reduce per-device memory requirements and mitigate the context information loss observed in sequence-parallel methods. The second investigates an agent-driven orchestration system for heterogeneous multi-device environments, with the goal of enabling practical deployment through dynamic model placement and latency-aware request scheduling. Together, these studies advance a unified approach to communication-efficient multi-device inference for Transformer models and establish both algorithmic and system-level foundations for deploying large models under limited communication.

Advisor:

Hui Guan

Online event posted in PhD Dissertation Proposal Defense for Faculty and Current students

More link

Join via Zoom

Site footer

Manning College of Information & Computer Sciences
  • Find us on Facebook
  • Find us on YouTube
  • Find us on LinkedIn
  • Find us on Instagram
  • Find us on Flickr
  • Find us on Bluesky Social
Address

140 Governors Dr
Amherst, MA 01003
United States

  • Visit CICS
  • Give
  • Contact Us
  • Employment
  • Events Calendar
  • Offices & Services

Info For

  • Current Undergraduate Students
  • Current Graduate Students
  • Faculty & Staff
  • Newly Accepted Undergraduate Students
University of Massachusetts Amherst
  • ©2026 University of Massachusetts Amherst
  • Site policies
  • Privacy
  • Non-discrimination notice
  • Accessibility
  • Terms of use