Content

Speaker:

Ashish Singh

Abstract:

As modern computer vision systems become increasingly pervasive in our daily lives, we expect them to perform robustly and reliably in real-world deployments. This is challenging because most methods are developed under a closed-world setting, in which situations not encountered during model training are not expected at test time. In practice, deployed systems routinely face open-world conditions—previously unseen object classes, events, and scene configurations—under which closed-world assumptions break down, leading to overconfident errors. Bridging this gap requires augmenting existing pipelines with mechanisms that detect and characterize novelties as they arise.

In this thesis, I address this challenge across several computer vision problems by developing methods that adapt standard models to the open world using side information—auxiliary, label-agnostic cues often available without manual supervision. The central idea is to leverage the structure already present in data so that models can detect and characterize novel samples rather than misclassify them.

First, for video analysis, I show that we can accurately detect and characterize novel events by augmenting a nearest-neighbor approach with location-conditioned motion and appearance attributes. Concretely, for a scene type (e.g., traffic), I first learn general, interpretable attribute encoders that capture appearance and motion. For a new target scene, I keep these encoders fixed and build a lightweight, location-dependent exemplar model of nominal behavior from the attribute embeddings. At inference time, deviations from this exemplar model are flagged as novel and reported with interpretable rationales. This design is open-world–adaptable: rather than training a new model from scratch for each new target scene, we are only required to compute the exemplar set of the new scene.

Next, I show that we can improve open-world object localization by leveraging non-objectness cues mined in an unsupervised manner. Specifically, I first identify representative non-object regions in training set images by utilizing a greedy nearest-neighbor approach. This compact codebook of background exemplars is then used as high-precision negatives for training the detector model. I show that our approach improves recall for localizing objects from unseen classes, without any manual supervision.

Finally, I propose to address the task of novelty detection when we have access to a specified hierarchical taxonomy.  In this problem, the goal is not only to detect novel samples at test time but also to be able to characterize them using the taxonomy’s structure, yielding more informative, human-interpretable outputs.

Advisor:

Erik Learned-Miller