Machine Learning and Friends Lunch (Online)

25 Mar

Thursday, 03/25/2021 11:45am to 1:15pm

Virtual via Zoom

Machine Learning and Friends Lunch

Title: Video Understanding with Modern Language Models

Abstract: Humans understand the world by processing signals from both vision and language. Similarly, we believe that language understanding can be beneficial for developing better video understanding systems. In this talk, I will present several of our proposed video understanding frameworks that incorporate models from the language domain. First, I will introduce TimeSformer, the first convolution-free architecture for video modeling built exclusively with self-attention. It achieves the best reported numbers on major action recognition benchmarks, and it is also more efficient than the state-of-the-art 3D CNNs. Afterwards, I will present COBE, a new large-scale framework for learning contextualized object representations in settings involving human-object interactions. Our approach exploits automatically-transcribed speech narrations from instructional YouTube videos, and it does not require manual annotations. Lastly, I will introduce a multi-modal video-based text generation framework Vx2Text, which outperforms state-of-the-art on three video based text-generation tasks: captioning, question answering and dialoguing.

Bio: Gedas Bertasius is a postdoctoral researcher at Facebook AI working on computer vision and machine learning problems. His current research focuses on topics of video understanding, first-person vision, and multi-modal deep learning. He received his Bachelors Degree in Computer Science from Dartmouth College, and a Ph.D. in Computer Science from the University of Pennsylvania. His recent work was nominated for the CVPR 2020 best paper award.

To obtain the Zoom link for this event, please see the event announcements from MLFL on the college email lists or contact Kalpesh Krishna.

Host

:

MLFL

Machine Learning and Friends Lunch (Online)

Subscribe to the CICS eNewsletter