Faculty Recruiting Support CICS

Understanding of Visual Domains via the Lens of Natural Language

26 May
Tuesday, 05/26/2020 3:00pm to 5:00pm
Zoom Meeting
PhD Dissertation Proposal Defense
Speaker: Chenyun Wu

Zoom meeting: https://umass-amherst.zoom.us/u/a9aiAaTk6


The joint understanding of vision and language is essential for an intelligent system to perceive, act, and communicate with humans in a wide range of applications. For example, a robot can interact and assist a human to navigate in a scene, or an automatic system can edit the visual content of images or videos through natural language commands. In this thesis, we aim to improve our understanding of visual domains through the lens of natural language descriptions. We specifically look into four visual domains: (1) images of categories within a fine-grained taxonomy, (2) images of texture which describes local patterns, (3) objects and stuff regions in natural images, and (4) moments in videos (as future work). We demonstrate that by aligning visual representations with language, one can enable various applications such as image retrieval and editing, as well as fine-grained classification with naturally interpretable models.

While the representations vary across domains, we address common challenges when combing vision and language. In one line of work, we investigate how one can discover domain-specific language by asking humans to describe differences between visual instances across categories. We then show that systems trained to describe differences between images lead to a better interpretable basis for classification and other tasks. In another line of work, we design a system that allows image segmentation using natural language descriptions. Unlike standard benchmarks for object detection or semantic segmentation, a challenge is to handle the long-tail distribution of concepts as a result of which there is little training data for the vast majority of concepts. We design a modular framework that integrates object detection, semantic segmentation, and contextual reasoning with language that leads to better performance. We also contribute a large-scale dataset for this task. Our ongoing work investigates the effectiveness of language and vision models on the domain of textures, which despite their ubiquity has not been sufficiently studied in the literature. Textures diverse, yet due to their local nature, one can systematically vary them to create realistic synthetic examples, which allows us to investigate how disentangled visual representations are. A natural-language-based understanding of textures also allows us to describe attributes used by deep networks for fine-grained classification, where texture plays a key role. Future work will investigate how language can be used for temporal localization in videos.

Advisor: Subhransu Maji