Efficient Region Search For Object Detection

Sudheendra Vijayanarasimhan and Kristen Grauman
Department of Computer Sciences,
University of Texas at Austin


Problem

Recent work in object detection offers various ways to avoid sliding windows and improve localization efficiency. However, they assume that the object fits well inside rectangular or coarse polygonal regions. This is rather restrictive because

(a) A rectangle is imprecise    
(b) Extra features within a window may actually mislead the detector at test time, causing it to miss the object entirely
Image cfig2     Image cfig3 Image cfig4
Image with local features mapped to a linear cat classifier's responses. Dark red dots denote negatively weighted features; light green dots denote positively weighted features.

Goal

A region-based detector that sums classifier responses within connected subregions can more accurately detect non-boxy objects.

Image cfig5

Our main contribution is to show how to obtain this best-scoring region efficiently.

Given a test image,

The goal is to identify the subset of connected regions (nodes) whose summed scores are maximal.

Algorithm Overview

Background: Linear SVM with Bag of Words

Using a linear kernel SVM applied to a bag-of-features representation the classifier response for a region can be rewritten as:

$\displaystyle f(R) = \beta_v + \sum_{j=1}^K w_v^j h_v^j(R) =  \beta_v + \sum_{i=1}^{N} w_v^{c_i},$ (1)

Thus, the score of a region is the sum of its $ N$ features' word weights.

For detection, the goal is to determine the region within a novel image $ I$ that maximizes the score:

$\displaystyle R^\ast = \argmax_{R \in I}  f(R).$

This search problem can be reduced to finding the maximum-weight connected subgraph problem (MWCS) as follows:

Efficient Region Search (ERS)

Training

Results

We use three datasets: the PASCAL VOC 2007, the ETHZ Shapes, and the PASCAL VOC 2008 for evaluation.

Example Detections: ERS vs. ESS

Features Image 001762-feat Image 001890-feat Image 004658-feat Image 005200-feat Image 003310-dog-feat Image 006902-dog-feat Image 007515-dog-feat
ESS Image 001762-ess Image 001890-ess Image 004658-ess Image 005200-ess Image 003310-dog-ess Image 006902-dog-ess Image 007515-dog-ess
 ERS Image 001762-ers Image 001890-ers Image 004658-ers Image 005200-ers Image 003310-dog-ers Image 006902-dog-ers Image 007515-dog-ers
ERS-C Image 001762-ersc Image 001890-ersc Image 004658-ersc Image 005200-ersc Image 003310-dog-ersc Image 006902-dog-ersc Image 007515-dog-ersc

First row shows images with the sign of the point feature scores ( sign$ (w_v^{c_i})$) superimposed: red dots denote negatively weighted features, green dots denote positive features (best viewed in color). Remaining rows show detections returned by ESS and ERS. Last row illustrates how ERS-C can exclude spurious background regions.

Both methods seek the region that will accumulate the most green points while avoiding including excessive red ones. However, since ESS is restricted to finding the max scoring rectangle, it often over/underestimates the object's extent. Our method provides precise arbitrarily shaped detections.

Shape Features: ETHZ

We applied our ERS method using either the point features (blue curves), or the shape features (green curves). Using shape features, we see dramatic gains for almost all objects, since the ETHZ objects have parts well-described by their shape (e.g., mug handle, bottle neck).

Figure: Detection accuracy (top row) and example detections by our method (bottom row) on the ETHZ categories.
Image Applelogos-ersc-ess2 Image Bottles-ersc-ess2 Image Giraffes-ersc-ess2 Image Mugs-ersc-ess2 Image Swans-ersc-ess2        
Image tatoo-segs Image store-segs Image spiral-segs Image apple-store-segs Image spots-segs Image store3-segs Image stilllife-segs            

State-of-Art: PASCAL 2008

We also compare our approach against the method of [Nowozin et al 2009], which provides an approximate algorithm for using a global connectivity potential within a standard CRF.

  aerop. bicyc. bird boat bottle bus car cat chair cow dinin. dog horse motor. person potte. sheep sofa train tvmon. mean
ERS 0.324 0.109 0.268 0.262 0.121 0.405 0.244 0.389 0.120 0.324 0.300 0.288 0.280 0.337 0.257 0.119 0.394 0.224 0.453 0.259 0.274
ERS-C 0.325 0.058 0.257 0.262 0.104 0.405 0.240 0.399 0.097 0.319 0.300 0.249 0.261 0.280 0.249 0.107 0.404 0.210 0.445 0.272 0.262
CRF [Nowozin et al. 2009] 0.380 0.091 0.202 0.275 0.115 0.391 0.185 0.311 0.121 0.236 0.269 0.244 0.209 0.268 0.194 0.075 0.249 0.200 0.393 0.152 0.228

Our ERS approach outperforms the baseline on 17 out of 20 categories, and improves the mean overlap accuracy significantly ($ 0.274$ vs. $ 0.228$) which indicates the value of obtaining the optimal solution with our approach as opposed to the approximate inference procedure.

Computation Time

Our approach is very efficient in practice for all the datasets tested, showing the suitability of this PCST reduction for our problem setting. On average it converges in $ 0.29$ seconds, which is similar to ESS. The longest time taken over any single test image was $ 5.8$ secs.

Image pascal-ess-ers-times Image ess-ers-times  
Our search times on both PASCAL and ETHZ are similar to ESS's, and both are orders of magnitude faster than sliding window search. (Note time is on log scale.)

Conclusions

Publication

Efficient Region Search for Object Detection,
S. Vijayanarasimhan and K. Grauman, in CVPR 2011
[papercode, groundtruth for cat/dog]