Content

Speaker

Oindrila Saha

Abstract

Enabling machines to perform visual reasoning has been a long-standing challenge. While humans can understand and recognize concepts with just a few examples, deep learning models require large-scale datasets to learn effectively. The cost of obtaining precise annotations at scale becomes prohibitive for tasks that demand expert knowledge.

These tasks, falling under an umbrella term of fine-grained reasoning, still remain a significant challenge for machine learning systems. Fine-grained reasoning encompasses a variety of problems that require complex and detailed understanding, ranging from object-level tasks such as classifying images of visually similar bird species and the spatially fine-grained problem of identifying individual parts within objects or scene-level tasks like composing images with multiple distinct objects. To advance progress in this domain, methods that minimize reliance on costly supervision while maintaining or enhancing fine-grained reasoning are essential.

This thesis advances fine-grained reasoning under limited human supervision through several complementary approaches. With method to learn representations from unlabeled data being trained either in a generative or a discriminative manner, we first explore which of these representations offer richer features for downstream discriminative tasks. Substantial recent progress in unsupervised representation learning along both directions owing to the rise of large neural networks, highlights the need for a new, systematic comparison. We evaluate these strategies for few-shot part segmentation, identifying strengths and trade-offs with respect to performance, robustness, and computational demands. We then introduce a novel method to discover and contrast object parts within images, enhancing both classification and segmentation accuracy. Next, we show how integrating coarse annotation modalities—such as keypoint or foreground-background labels—can improve dense part segmentation beyond what is achievable using only limited dense annotations. Subsequently, we explore natural language as a powerful source of fine-grained information, leveraging large language models to generate weakly supervised text descriptions to adapt vision-language representations, leading to better classification in fine-grained tasks. Furthermore, we explore methods to form joint structure among both the image and text modalities so as to improve classification without any ground truth labels. Finally, we focus on the scene-level fine-grained task of compositional image generation, where we tackle the challenge of creating synthetic data at scale employing the help of both generative and discriminative models.

Advisor

Subhransu Maji