Keywords to Visual Categories: Multiple-Instance Learning for Weakly Supervised Object Categorization

Sudheendra Vijayanarasimhan and Kristen Grauman
Department of Computer Sciences,
University of Texas at Austin

Figure: Example images from a popular labeled dataset for the category `face' (left) compared against example results for Google image search for the same category (right).

vs.

Carefully prepared labeled datasets are useful for learning visual category models, but they are expensive to obtain. On the other hand, images on the Web indexed using keywords are available in large quantities, but do not necessarily portray the query term and are much more variable than typical benchmark training sets. How can we learn reliable category models from images returned by keyword search?

Algorithm Overview

Given just the names of the categories, we directly obtain discriminative models by considering the problem in a multiple-instance visual category learning scenario.

Multiple Instance Learning (MIL) considers a particular form of weak supervision where the classifier is given positive bags, which are sets of instances containing at least one positive example and negative bags, which are sets containing no positive examples.

A single search for a keyword of interest yields a collection of images within which (we assume) at least one image depicts that object, thus comprising a positive bag. This enables us to pose the problem of learning classifiers from web searches as a multiple-instance learning problem.

On average 30% of the images returned by Google for a small set of categories contain a ``good'' view of the class of interest, 20% are of ``ok'' quality and 50% are completely unrelated ``junk'' (as judged in [Fergus et al. 2005]). By explicitly accounting for the fact that there could be as little as one good example in the set of returned images we are able to deal with the large amount of noise in the training data.

(a) Given a category name, our method automatically collects noisy `positive bags' of instances via keyword-based image search on multiple search engines in multiple languages and learns a multiple-instance (MIL) classifier.

(b) From the initial MIL solution, the classifier improves itself by iteratively updating the representation of the training bags.

Results

We evaluate our method for two kinds of tasks: categorizing novel examples, and re-ranking keyword search returns according to predicted relevance on the publicly available Google dataset ([Fergus et al. 2005]) and the Animals dataset ([Berg et al. 2005]). We download training examples for each category name using Google Image Search, Yahoo, and MSN, in five languages and represent images with bags of local SIFT features from four interest operators (Harris-Affine, DoG, edges, Kadir & Brady), and use an RBF kernel.

Figure: Results on categorizing novel examples.

(a) Comparison of the AUROC for our unsupervised approach and the state-of-the-art supervised approaches on the Caltech-7 dataset.

(b) Comparison of our approach against the unsupervised approach of Fergus et al. 2005.

Our results improve the state-of-the-art in unsupervised recognition, and are even competitive with fully supervised techniques.

Figure: Results on re-ranking search returns according to predicted relevance.

(a) Precision at 100 image recall on the Google dataset [Fergus et al., 2005]

(b) Precision at 100 image recall on the animals dataset [Berg et al., 2005].

Using image content alone, our approach provides accurate re-ranking results, and for some classes improves precision more than methods employing both text and image features.

In comparison to previous approaches that deal with the problem of learning from web search returns ([Fergus et al. 2005, Berg et al. 2005, Schroff et al.]) we use image features alone and directly obtain discriminative classifiers by explicitly accounting for the noise in the search returns.

Semantic Robot Vision Challenge

The Semantic Robot Vision Challenge (SRVC) is a workshop conducted every year where fully autonomous robots receive a text list of objects that they are to find. They use the web to automatically find image examples of those objects in order to learn visual models. These visual models are then used to identify the objects in the robot's cameras.

We applied our Multiple-Instance Learning based approach in the 2008 version of the competition. The text list contained about 20 objects with an equal mix of specific objects such as DVD ``300'' and object categories like ``fax machine''.

We automatically downloaded web images from multiple search engines and used our approach to build classifiers for all 20 objects setting the parameters automatically. To localize objects in the test images we used the sliding window approach and returned the bounding box that obtained the largest response from the classifiers. Following are the results on the challenge

**Figure:** Results on the SRVC challenge. The ``spam'' example was found within the allotted 30 minutes while the rest were found when the algorithm was allowed to complete its run (37 minutes).

Following are results from the practice round that was run by the organizers under the same settings as the final round with a different set of objects.

In addition to specific objects, which have little variation in their appearance, our method was able to find a large number of generic objects such as ``electric iron'', ``vacuum cleaner'', ``calculator'', ``brown pen'', ``apple'', etc. despite the noisy data used to train the classifiers.

Publication

Keywords to Visual Categories: Multiple-Instance Learning for Weakly Supervised Object Categorization,
Sudheendra Vijayanarasimhan and Kristen Grauman, In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR 2008.
[paper][poster] [code, readme]

Keywords to Visual Categories: Multiple-Instance Learning for Weakly Supervised Object Categorization

Introduction

Algorithm Overview

Results

Semantic Robot Vision Challenge

Publication