Faculty Recruiting Support CICS

Higher-Order Representations for Visual Recognition

05 Dec
Thursday, 12/05/2019 2:00pm to 4:00pm
PhD Thesis Defense
Speaker: Tsung-Yu Lin

Orderless feature aggregation with nonlinear encodings such as Fisher Vector representation had been shown to be effective with hand-crafted local image features in various image recognition tasks. The encoding captures higher-order statistics on a set of feature activations; however, the feature descriptors are not learned to optimize the end tasks. In this thesis, we present simple and effective encoding models called Bilinear Convolutional Neural Networks (B-CNNs) to capture the correlations between the activations of feature descriptors derived from CNNs. The models belong to the class of orderless texture representations, but unlike prior work, they can be trained in an end-to-end manner. The models outperformed the previous state-of-the-art on fine-grained and texture recognition.

To understand these models, we visualize the convolutional filters and the classifiers for the fine-tuned networks. The visualization of the top-activating patches against the learned CNNs filters demonstrates that the models are able to capture highly localized attributes. At the classifier level, we visualize the invariance of these models by inverting the representations and output the preimages which reveal the properties captured by the models for a given category.

Finally, we study the techniques for rescaling the importance of individual features during aggregation to enhance the discriminative power of the representations. Spectral normalization scales the spectrum of the covariance matrix obtained after bilinear pooling and offers a significant improvement; however, the operation is not computationally efficient on modern GPUs. We present an iteration-based approximation of matrix square-root along with its gradients to speed up the computation and study its effect in fine-tuning with deep neural networks. Another approach using democratic aggregation achieves comparable improvement while the aggregation can be approximated in a low-dimensional embedding and thus the approach is friendly to aggregating higher-dimensional features. We demonstrate that the two approaches are closely related and we discuss the trade-off between the performance and the efficiency.

Advisor: Subhransu Maji