PhD Thesis Defense: Fabien Delattre, Ego-Motion Estimation, Video Synchronization, and Self-Balancing
Content
Speaker:
Abstract:
Motion is a fundamental part of the visual world. For animals, the ability to perceive motion is essential for survival, whether to escape predators, locate prey, or navigate dynamic environments. Similarly, there is a need to develop computer vision algorithms that can leverage perceived motion to provide a rich source of information for understanding the world and guiding action. This thesis examines three complementary roles of motion in computer vision: estimating the ego-motion of the observer, leveraging scene motion for downstream applications, and using motion cues directly for control.
First, we introduce a method to estimate the rotation of a handheld camera in the presence of many moving objects. Because this setting is not well addressed by existing datasets, we provide a new dataset and benchmark composed of videos recorded in crowded city scenes. Our method can be viewed as a generalization of the Hough transform over the rotation space. In short, optical flow vectors at distant points provide consistent evidence for the correct rotation, while flow vectors influenced by translation, scene geometry, moving objects, and noise do not produce a consistent estimate. By accumulating evidence across the rotation space, we recover the rotation with the strongest support. Unlike the commonly used RANSAC algorithm, the runtime of our method is independent of the proportion of moving objects, which provides a significant speedup when the outlier rate is high.
Second, we propose a method to temporally align multiple videos recorded in the same scene. When videos are captured from different viewpoints without precise synchronization, aligning them after the fact can be difficult. The metadata or audio needed to do so may be missing or unreliable. Human motion, however, provides a strong cue for identifying matching time points across videos through pose and movement. In this part, we leverage view-invariant human pose features to synchronize videos. Unlike previous human pose-based alignment methods, our approach can align videos containing multiple people without tracking or re-identification across views. We achieve this by aggregating pose information from multiple people into a single frame descriptor. This also enables an efficient O(n log n) search for the optimal alignment. This simple but effective strategy leads to major and consistent improvements over existing human-based and visual-feature-based temporal alignment methods.
Finally, we study control directly from visual motion through the task of self-balancing. Rather than estimating ego-motion as an intermediate quantity and then using that estimate for control, we ask whether motion-based visual observations can support closed-loop control directly. To do so, we introduce a new 3D visual self-balancing environment derived from the inverted pendulum, in which a camera is rigidly mounted to the pole and the agent has no access to the ground-truth state. We use image correspondences as the observation modality, making motion explicit while reducing dependence on raw appearance. In this setting, we show that a controller trained from visual correspondences can learn to stabilize the system, and we analyze how performance depends on observation rate, latency, and access to the previous control input.
Advisor:
Erik Learned-Miller