Faculty Recruiting Support CICS

Understanding the Dynamic Visual World: from Motion to Semantics

31 Jan
Friday, 01/31/2020 1:00pm to 3:00pm
A311 LGRC
PhD Dissertation Proposal Defense
Speaker: Huaizu Jiang

Abstract:

Human beings have the remarkable capability to learn from limited data, with partial or little annotation, in sharp contrast to computational vision models that rely on large-scale, manually labeled data. Reliance on strongly supervised models with manually labeled data inherently prohibits us from modeling the dynamic visual world, as manual annotations are tedious, expensive, and not scalable, especially if we would like to solve multiple scene understanding tasks at the same time. Even worse, in some cases,  manual annotations are completely infeasible, such as the motion vector of each pixel (ie, optical flow) since humans cannot reliably produce these types of labeling. Motion information contained in real-world videos, as a result of moving camera, independently moving objects, and scene geometry, consists of abundant information, revealing the structure and complexity of our dynamic visual world. In this thesis, we investigate how to use the motion information contained unlabeled or partially labeled videos to reduce reliance on manual annotations.

Understanding the motion contained in a video enables us to perceive the dynamic visual world in a novel manner. In the first part, we present an approach, called SuperSloMo, which synthesizes slow-motion videos with a plain camera. Converting a plain video into a slow-motion version enables us to see memorable moments in our life that are hard to see clearly otherwise with naked eyes: a difficult skateboard trick, a dog catching a ball, etc. Such a technique also has wide applications such as generating smooth view transition on a head-mounted VR (virtual reality device), compressing videos, synthesizing videos with motion blur, etc.

Humans'' remarkable capability to learn from sparse data may stem in part from our ability to interpret the dynamic visual world holistically. When interpreting a scene, people simultaneously infer many properties such as depth, motion, location, and the semantic categories of different elements. Solving one task is often helpful in solving others; for example, a car's motion is likely to be rigid while a person's movement is non-rigid. Thus, motion could provide a cue for object identification, or conversely, knowledge of the object could help us interpret the motion. In the second part of this thesis, we present a self-supervised approach that learns visual representations from relative scene depth recovered from motion field of unlabeled videos, which are helpful for downstream tasks, including semantic image segmentation and object detection. We also present a semi-supervised approach that solves multiple tasks, including optical flow estimation, stereo disparity estimation, occlusion detection, and semantic segmentation together, which leads to more efficient and accurate scene understanding.

Advisor: Erik Learned-Miller