Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds

Sudheendra Vijayanarasimhan and Kristen Grauman
Department of Computer Sciences,
University of Texas at Austin

Problem

Active learning and crowd-sourced labeling are promising ways to efficiently build up training sets for object recognition, but thus far techniques are tested in artificially controlled settings.

Specifically,

active learning methods have only been tested on ``sandbox'' data for which the true labels are already known
(``sandbox'' data - the vision researcher has already determined the dataset's source and scope which compromises the generality of a model learned using it)
most crowd-sourced collection processes require iterative manual fine-tuning by the algorithm designer

Goal

Our goal in this work is to take crowd-sourced active annotation out of the ``sandbox'' and automate object detector training.

Given just the name of category, we present an approach for ``live learning'' of detectors for that category

Rather than fill the data pool with some canned dataset, the system

automatically gathers relevant images via keyword search
repeatedly surveys the data to identify unlabeled sub-windows that are most uncertain according to the current model
generates tasks on MTurk to get the corresponding bounding box annotations

Throughout the procedure we do not intervene with what goes into the system's data pool, nor the annotation quality from the hundreds of online annotators.

Challenges

Large-scale active selection

each image contains a large number of candidate object windows
only some of which are useful
with 1000 windows per image unlabeled pool could be millions of examples

Algorithm Overview

We introduce a novel partbased detector amenable to linear classifiers, and show how to identify its most uncertain instances in sub-linear time with our recently proposed hashing-based solution.

Linear classification

Candidate window generation

Large-scale active selection

Annotation collection

Image smpmodel

Image jwin

Image hashing

Image mturk

We design a part-based object representation such that a simple linear classifier will be adequate for robust detection.

Given,

root window

multiple part windows , . . . , that overlap the root

and context windows , . . . ,

surrounding an object, we concatenate max-pooled responses from a sparse coding of the features within each window.

We use a grid-based variant of Hough-like projections for generating candidate object windows from unlabeled images

We divide a training image window into $N \times M$ grid and record triplets (visual word, grid location, bounding box)

We rank triplets based on how frequently they occur at a particular grid location and take the 3000 top-ranked boxes from each unlabeled image

We initialize the online active learning system with a linear SVM trained with a small number of labeled examples.

We hash all generated unlabeled windows into a hash-table using our hyperplane hash function.

During active selection, we hash the linear detector directly to the bin containing most useful examples

We post selected images on Mechanical Turk.

We provide multiple options to avoid incorrect boxes.

We post same image to multiple (5-10) annotators for consensus.

Summary: Live Learning

The main loop consists of using the current classifier to generate candidate jumping windows, storing all candidates in a hash table, querying the hash table using the hyperplane classifier, giving the actively selected examples to online annotators, taking their responses as new ground truth labeled data, and updating the classifier.

Results

We evaluate our approach on both benchmark PASCAL 2007 data and by running the live learning process on Flickr.

State-of-Art

	classif	parts	feats	cands	aero.	bicyc.	bird	boat	bottl	bus	car	cat	chair	cow	dinin.	dog	horse	motor.	person	potte.	sheep	sofa	train	tvmon.	Mean
Ours	linear	yes	single	jump	48.4	48.3	14.1	13.6	15.3	43.9	49.0	30.7	11.6	30.3	13.3	21.8	43.6	45.0	18.2	11.1	28.8	33.0	47.7	43.0	30.5
LSVM+HOG	nonlinear	yes	single	slide	32.8	56.8	2.5	16.8	28.5	39.7	51.6	21.3	17.9	18.5	25.9	8.8	49.2	41.2	36.8	14.6	16.2	24.4	39.2	39.1	29.1
SP+MKL	nonlinear	no	multiple	jump	37.6	47.8	15.3	15.3	21.9	50.7	50.6	30.0	17.3	33.0	22.5	21.5	51.2	45.5	23.3	12.4	23.9	28.5	45.3	48.5	32.1

Our results are competitive with state-of-art (better for 6 classes) using a linear classifier which is faster to train.

Active Learning on PASCAL

[2] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object Detection with Discriminatively Trained Part Based Models. TPAMI, 99(1), 2009.

[3] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple Kernels for Object Detection. In ICCV, 2009.

Active selection outperforms passive baseline and we obtain close to state-of-art results using only one third of the data.

Live Learning on Flickr

We ran the live learning process by automatically downloading example images for 6 of the most challenging categories in PASCAL by keyword search on Flickr. The system automatically obtained ground truth on actively selected images by posting on Mechanical turk.

We obtain dramatic improvements for most categories and active selection is better for 4/6 categories.

Comparison to Previous Best

	aeroplane	bird	boat	cat	dog	sheep	sofa	train
Ours	48.4	15.8*	18.9*	30.7	25.3*	28.8	33.0	47.7
Previous best	37.6	15.3	16.8	30.0	21.5	23.9	28.5	45.3

*means using extra Flickr data automatically obtained by our system

Our best results on PASCAL VOC 2007 detection testset. Our method outperforms the state-of-the-art on 8 out of 20 categories.

Example Detections

true positives
false positives

Computation Time

	Active selection	Training	Detection per image
Ours + active	10 mins	5 mins	150 secs
Ours + passive	0 mins	5 mins	150 secs
LSVM [2]	3 hours	4 hours	2 secs
SP+MKL [3]	93 hours	2 days	67 secs

Run-time comparisons of different stages of our detector against the passive baseline and other state-of-the-art detectors. Our detection time is mostly spent pooling the sparse codes. Active times are estimated for [2,3] models based on linear scan. Our approach's efficiency in selecting useful images and retraining the classifier makes live learning practical.

Conclusions

Our contributions are

a novel efficient part-based linear detector that provides excellent performance
a jumping window and hashing scheme suitable for the proposed detector that retrieves relevant instances among millions of candidates
the first active learning results for which both image data and annotations are automatically obtained, with minimal involvement from vision experts.

Tying all these parts together, we demonstrated an effective end-to-end system for online learning of object detectors.

Publication

Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds,
S. Vijayanarasimhan and K. Grauman, in CVPR 2011
[paper]

Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning,
P. Jain, S. Vijayanarasimhan and K. Grauman, in NIPS 2010
[paper, supplementary]