Faculty Recruiting Support CICS

MLSys Seminar: Improve Machine Learning Efficiency for Vision-Language Foundation Models

17 Oct
Tuesday, 10/17/2023 10:00am to 11:00am
Virtual via Zoom
Seminar

Abstract: Deep Neural Networks have excelled in computer vision tasks. However, their deployment can be challenging due to the large model size especially when each vision task demands its specifically fine-tuned network. In 2021, large Vision-Language Foundation Models (VLFMs) such as CLIP, emerged and demonstrated superior multi-task performance—often referred to as transferability—across many downstream tasks. Nevertheless, training these VLFMs relies heavily on extensive image-caption datasets and requires resource-intensive training. Attempts to reproduce VLFM using smaller public datasets and fewer computational resources often lead to reduced accuracy. Moreover, while VLFM exhibits impressive transferability to various image recognition datasets, adapting it to other vision tasks, like multi-label classification or object detection, is not straightforward.

To address the challenges mentioned previously, our research undertook several targeted approaches. Firstly, recognizing the prohibitive pre-training costs associated with VLFM replication, we introduced a novel distillation mechanism (``DIME-FM''). This mechanism facilitates the transfer of knowledge from CLIP to smaller models using fewer, unpaired images and sentences, thereby improving training efficiency. We transfer the knowledge from the pre-trained CLIP-ViT-L/14 model to a ViT-B/32 model, with only 40M public images and 28.4M unpaired public sentences. The resulting model `Distilled-ViT-B/32' rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset (400M image-text pairs). Thirdly, beyond single-label image recognition, we studied the fast adaption of CLIP (as an example of VLFM) to multi-label image recognition tasks. We proposed Dual Context Optimization (``DualCoOp'') technique by learning paired positive and negative prompts without finetuning the large vision-language backbone. Learning the small number of parameters in prompts allows CLIP to rapidly adapt across various multi-label image recognition datasets with limited annotation.

Bio: Ximeng Sun is currently a PhD student in Computer Science at Boston University starting from 2019 Spring, supervised by Prof. Kate Saenko. I am interested in deep learning and computer vision. In particular, I work on efficient learning, vision-language models, and multi-task learning. I have been fortunate to collaborate with top research labs as an intern, including Meta AI, Google Cloud, and IBM Research. During 2022, I was part of Meta AI's team where I had the opportunity to collaborate with Xide Xia, Pengchuan Zhang, and Peizhao Zhang. In 2021 Summer, I joined Google Cloud where I worked closely with Clayton Mellina, Xiao Bian, and Kihyuk Sohn. In 2019 and 2020 Summer, I worked alongside Rogerio Feris and Rameswar Panda at IBM Research.

Join the Seminar

Faculty Host
: