Faculty Recruiting Support CICS

Improving Face Clustering in Videos

02 Dec
Monday, 12/02/2019 1:30pm to 3:30pm
A311 LGRC
Ph.D. Thesis Defense
Speaker: SouYoung Jin

Human faces represent not only a challenging recognition problem for computer vision, but are also an important source of information about identity, intent, and state of mind. These properties make the analysis of faces important not just as algorithmic challenges, but as a gateway to developing computer vision methods that can better follow the intent and goals of human beings. In this thesis, we are interested in face clustering in videos. Given a raw video, with no caption or annotation, we want to group all detected faces by their identity. We address three problems in the area of face clustering and propose approaches to tackle them.

The existing linkage-based face-clustering system is sensitive to a false connection between two different people. We introduce a new similarity measure that helps the verification system to provide very few false connections at moderate recall. Further, we also introduce a novel clustering method called Erdos and Renyi clustering, which is based on the observations from a random graph model theory, that large clusters can be fully connected by joining just a small fraction of their node pairs. Our results present state-of-the-art results on multiple video data sets and also on standard face databases.

What happens if faces are not sufficiently clear for direct recognition, due to the small scale, occlusion, or extreme pose? We observe that, when humans are uncertain about the identity of two faces, we use clothes or other contextual cues, e.g. specific objects or textures, to infer identity. With this observation, we propose the Face-Background Network (FB-Net), which takes as input not only the faces but also the entire scene to enhance the performance of face clustering. In order for the network to learn background features that are informative about the identity, we introduce a new dataset that contains face identities in the context of consistent scenes. We show that FB-Net outperforms the state-of-the-art method which uses face-level features only for the task of video face clustering.

The performance of face clustering depends on a good face detector. However, improving the performance of a face detector requires expensive labeling of faces. In this work, we propose an approach to reduce mistakes of the existing face detector by using many hours of freely available unlabeled videos on the web. Specifically, with the observation that false positives/negatives are often isolated in time, we demonstrate a method to mine hard examples automatically using temporal continuity in videos. In particular, we analyze the output of a trained detector on video sequences and mine detections that are isolated in time, which is likely to be hard examples. Our experiments show that re-training detectors on these automatically obtained examples often significantly improves performance. We present experiments on multiple architectures and multiple data sets, including face detection, pedestrian detection, and other object categories.

Advisor: Erik Learned-Miller