Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds

Sudheendra Vijayanarasimhan and Kristen Grauman
Department of Computer Sciences,
University of Texas at Austin


Problem

Active learning and crowd-sourced labeling are promising ways to efficiently build up training sets for object recognition, but thus far techniques are tested in artificially controlled settings.

Specifically,

Goal

Our goal in this work is to take crowd-sourced active annotation out of the ``sandbox'' and automate object detector training.

Given just the name of category, we present an approach for ``live learning'' of detectors for that category

Image livelearning5

Rather than fill the data pool with some canned dataset, the system

Throughout the procedure we do not intervene with what goes into the system's data pool, nor the annotation quality from the hundreds of online annotators.

Challenges

Large-scale active selection

Image largescale5



Algorithm Overview

We introduce a novel partbased detector amenable to linear classifiers, and show how to identify its most uncertain instances in sub-linear time with our recently proposed hashing-based solution.

Linear classification
Candidate window generation
Large-scale active selection
Annotation collection
Image smpmodel
Image jwin
Image hashing
Image mturk
  • We design a part-based object representation such that a simple linear classifier will be adequate for robust detection.

  • Given,
    • root window $ r$
    • multiple part windows $ p_1$, . . . , $ p_P$ that overlap the root
    • and context windows $ c_1$, . . . , $ c_C$
    surrounding an object, we concatenate max-pooled responses from a sparse coding of the features within each window.

  • We use a grid-based variant of Hough-like projections for generating candidate object windows from unlabeled images
  • We divide a training image window into $ N \times M$ grid and record triplets (visual word, grid location, bounding box)
  • We rank triplets based on how frequently they occur at a particular grid location and take the 3000 top-ranked boxes from each unlabeled image
  • We initialize the online active learning system with a linear SVM trained with a small number of labeled examples.

  • We hash all generated unlabeled windows into a hash-table using our hyperplane hash function.

  • During active selection, we hash the linear detector directly to the bin containing most useful examples
  • We post selected images on Mechanical Turk.

  • We provide multiple options to avoid incorrect boxes.

  • We post same image to multiple (5-10) annotators for consensus.



Summary: Live Learning

Image live-summary

The main loop consists of using the current classifier to generate candidate jumping windows, storing all candidates in a hash table, querying the hash table using the hyperplane classifier, giving the actively selected examples to online annotators, taking their responses as new ground truth labeled data, and updating the classifier.



Results

We evaluate our approach on both benchmark PASCAL 2007 data and by running the live learning process on Flickr.

State-of-Art

  classif parts feats cands aero. bicyc. bird boat bottl bus car cat chair cow dinin. dog horse motor. person potte. sheep sofa train tvmon. Mean
Ours linear yes single jump 48.4 48.3 14.1 13.6 15.3 43.9 49.0 30.7 11.6 30.3 13.3 21.8 43.6 45.0 18.2 11.1 28.8 33.0 47.7 43.0 30.5
LSVM+HOG nonlinear yes single slide 32.8 56.8 2.5 16.8 28.5 39.7 51.6 21.3 17.9 18.5 25.9 8.8 49.2 41.2 36.8 14.6 16.2 24.4 39.2 39.1 29.1
SP+MKL nonlinear no multiple jump 37.6 47.8 15.3 15.3 21.9 50.7 50.6 30.0 17.3 33.0 22.5 21.5 51.2 45.5 23.3 12.4 23.9 28.5 45.3 48.5 32.1

Our results are competitive with state-of-art (better for 6 classes) using a linear classifier which is faster to train.

Active Learning on PASCAL

Image all-20train-ap1
[2] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan. Object Detection with Discriminatively Trained Part Based Models. TPAMI, 99(1), 2009.

[3] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman. Multiple Kernels for Object Detection. In ICCV, 2009.

 

Active selection outperforms passive baseline and we obtain close to state-of-art results using only one third of the data.

Live Learning on Flickr

We ran the live learning process by automatically downloading example images for 6 of the most challenging categories in PASCAL by keyword search on Flickr. The system automatically obtained ground truth on actively selected images by posting on Mechanical turk.

Image all-flickrtest-ap-wind

We obtain dramatic improvements for most categories and active selection is better for 4/6 categories.

Comparison to Previous Best

  aeroplane bird boat cat dog sheep sofa train
Ours 48.4 15.8* 18.9* 30.7 25.3* 28.8 33.0 47.7
Previous best 37.6 15.3 16.8 30.0 21.5 23.9 28.5 45.3
*means using extra Flickr data automatically obtained by our system

Our best results on PASCAL VOC 2007 detection testset. Our method outperforms the state-of-the-art on 8 out of 20 categories.

Example Detections

true positives Image truepos
false positives Image falsepos

Computation Time


  Active selection Training Detection per image
Ours + active 10 mins 5 mins 150 secs
Ours + passive 0 mins 5 mins 150 secs
LSVM [2] 3 hours 4 hours 2 secs
SP+MKL [3] 93 hours $ >$ 2 days 67 secs
Run-time comparisons of different stages of our detector against the passive baseline and other state-of-the-art detectors. Our detection time is mostly spent pooling the sparse codes. Active times are estimated for [2,3] models based on linear scan. Our approach's efficiency in selecting useful images and retraining the classifier makes live learning practical.

Conclusions

Our contributions are Tying all these parts together, we demonstrated an effective end-to-end system for online learning of object detectors.

Publication

Large-Scale Live Active Learning: Training Object Detectors with Crawled Data and Crowds,
S. Vijayanarasimhan and K. Grauman, in CVPR 2011
[paper]

Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning,
P. Jain, S. Vijayanarasimhan and K. Grauman, in NIPS 2010
[paper, supplementary]