Distinguishing subtle differences in attributes is valuable, yet learning to make visual comparisons remains nontrivial. Not only is the number of possible comparisons quadratic in the number of training images, but also access to images adequately spanning the space of fine-grained visual differences is limited. We propose to overcome the sparsity of supervision problem via synthetically generated images. Building on a state-of-the-art image generation engine, we sample pairs of training images exhibiting slight modifications of individual attributes. Augmenting real training image pairs with these examples, we then train attribute ranking models to predict the relative strength of an attribute in novel pairs of real images. Our results on datasets of faces and fashion images show the great promise of bootstrapping imperfect image generators to counteract sample sparsity for ranking.
Fine-Grained Visual Comparisons: Given two images and a visual attribute, determine the one with more of the attribute than the other. The "fine-grained" case refers to scenarios where the images are highly similar with respect to the target attribute.
Fine-grained analysis of images often entails making visual comparisons. For example, given two products in a fashion catalog, a shopper may judge which shoe appears more sporty. Given two sunset photos from a vacation, a user may determine which one to keep or delete. In these and many other such cases, we are interested in inferring how a pair of images compares in terms of a particular property, or "attribute". Importantly, the distinctions of interest are often quite subtle.
While exisiting research focus on developing better ranking models that are trained on pairs of labeled images, we approach the problem orthogonally from the data perspective. We hypothesize that fine-grained learning is fundamentally limited due to the issue of the sparsity of supervision. This sparsity stems from two factos:
Problem Statement: To densify the attribute space using synthetic image pairs to improve supervision for fine-grained learning.
We propose to use synthetic image pairs to overcome the sparsity of supervision problem when learning to compare images. The main idea is to synthesize plausible images exhibiting variations along a given attribute from a generative model, thereby recovering samples in regions of the attribute space that are underrepresented among the real training images. After (optionally) verifying the comparative labels with human annotators, we train a discriminative ranking model using the synthetic training pairs in conjunction with real image pairs. The resulting model predicts attribute comparisons between novel pairs of real images.
Important: Our approach is model independent. We modify only the composition of the training data and nothing else.
The key to improving coverage in the attribute space is the ability to generate images exhibiting subtle differences — with respect to the given attribute — while keeping the others constant. In other words, we want to walk semantically in the high-level attribute space.
For this task, we adpot an existing attribute-conditioned image generation engine called Attribute2Image [Yan et al. 16]. Given a set of attributes and some latent factors, which forms a synthetic identity (red boxes), we can generate an entire specturm of images exhibiting subtle differences. Using these generated spectrums, we form image pairs through inter- and intra-identity sampling.
Our idea can be seen as semantic "jittering" of the data to augment real image training sets with nearby variations. The systemic perturbation of images through label preserving transforms like mirroring/scaling is now common practice in training deep networks for classification. Whereas such low-level image manipulations are performed independent of the semantic content of the training instance, the variations introduced by our approach are high-level changes that affect the very meaning of the image, e.g., facial shape changes as the expression changes. In other words, our jitter has a semantic basis rather than a purely geometric/photometric basis.
Going along with our objective of fine-grained learning, we expand upon the exisiting UT-Zappos50K shoe dataset by (1) crowdsourcing for a new set of fine-grained attribute lexicon, and (2) collecting over 3 times more pairwise relative labels for each attribute. We use these data as the real shoe pairs in our analysis. Please visit our dataset page for more information.
As mentioned above, our approach is model independent. In this work, we experiment with two state-of-the-art ranking models from the attributes literature: RankSVM with local learning [Yu & Grauman 14] and DeepCNN with spatial transformer [Singh & Lee 16]. The figure above illustrates the different types of input compositions that we experiment with.
Observation: Our approach of densifying the training data using synthetic image pairs outperforms all baselines on all attributes from two domains: shoes and faces. Even with 2x the real data, the state-of-the-art models fail to predict fine-grained differences as well as when our synthetic data are added, demonstrating the importance of having a dense set of training data (results from the deep model shown above). Overall, our gains are significant, considering they are achieved without any changes to the underlying ranking models, the features, or the experimental setup.
The following are the data used in this project, including the synthetic images and their pairwise labels. Refer to the UT-Zappos50K dataset page for the comprehensive data on the real shoe images. Please contact directly for pre-trained models.
A. Yu and K. Grauman. "Semantic Jitter: Dense Supervision for Visual Comparisons via Synthetic Images". In ICCV, 2017. [bibtex]
@InProceedings{semjitter,
author = {A. Yu and K. Grauman},
title = {Semantic Jitter: Dense Supervision for Visual Comparisons via Synthetic Images},
booktitle = {International Conference on Computer Vision (ICCV)},
month = {Oct},
year = {2017}
}