Efficient Region Search For Object Detection

Sudheendra Vijayanarasimhan and Kristen Grauman
Department of Computer Sciences,
University of Texas at Austin

Problem

Recent work in object detection offers various ways to avoid sliding windows and improve localization efficiency. However, they assume that the object fits well inside rectangular or coarse polygonal regions. This is rather restrictive because

(a) A rectangle is imprecise

(b) Extra features within a window may actually mislead the detector at test time, causing it to miss the object entirely

Image with local features mapped to a linear cat classifier's responses. Dark red dots denote negatively weighted features; light green dots denote positively weighted features.

Goal

A region-based detector that sums classifier responses within connected subregions can more accurately detect non-boxy objects.

Our main contribution is to show how to obtain this best-scoring region efficiently.

Given a test image,

we first divide it into segments and construct a region-graph: the nodes are segments, and the edges link any two adjacent segments.
we then compute a score for each component region that indicates how much it could contribute to a possible detection of the object of interest.

The goal is to identify the subset of connected regions (nodes) whose summed scores are maximal.

Algorithm Overview

Background: Linear SVM with Bag of Words

Using a linear kernel SVM applied to a bag-of-features representation the classifier response for a region can be rewritten as:

$\displaystyle f(R) = \beta_v + \sum_{j=1}^K w_v^j h_v^j(R) = \beta_v + \sum_{i=1}^{N} w_v^{c_i},$

(1)

$h_v^j \rightarrow$ denotes the histogram count of visual word ,
$w_v^j \rightarrow$ the linear SVM weight,
$c_i \rightarrow$ is the index of the visual word that feature $\bm{v}_i$ maps to, $c_i \in [1,K]$ .

Thus, the score of a region is the sum of its features' word weights.

For detection, the goal is to determine the region within a novel image that maximizes the score:

$\displaystyle R^\ast = \argmax_{R \in I} f(R).$

This search problem can be reduced to finding the maximum-weight connected subgraph problem (MWCS) as follows:

Efficient Region Search (ERS)

Given a test image, we construct a region-graph , where is a set of vertices obtained from a superpixel segmentation and are the edges defined between any two superpixels that share a boundary. Now a candidate region is any subset of connected nodes in this graph, or in other words, a connected subgraph.
Each vertex $\mathrm{v} \in V$ has an associated weight, $\omega(\mathrm{v})$ , which represents its contribution to the classifier score. We obtain a vertex's weight from the superpixel's histogram obtained using:
- Point features: For this representation, each descriptor $\bm{v}_i$ is local point feature (SURF), the weight assigned to a superpixel vertex is the sum of the word-weights for all local features located within that superpixel: $\omega(\mathrm{v}) = \sum_{\bm{x}_i \in \mathrm{v}} w_v^{c_i}$ .
- Shape features: Alternatively, each superpixel is mapped to a single shape descriptor by describing each region with the histogram of oriented responses of a contour and edge detector.
The objective is now to identify the maximum-weight connected subgraph in the region graph, defined as the MWCS problem. The MWCS problem can be transformed into an instance of the prize-collecting Steiner tree problem (PCST), which is a well-known problem occurring frequently in operations research and networking, and various optimal and approximate solutions exist.
We also introduce edge weights between pairs of adjacent superpixels based on the strength of their intervening contour where the best-scoring distribution of contours are based on the sum of spatially distributed scores (the edge weights), and the internal contour properties are class-specific. This can be easily accommodated into the PCST instance. We call this approach ERS-C.
Once our region-graph is mapped to the PCST problem, we use the mathematical programming approach of [Ljubic et al. 2006] to identify the maximum weight connected subgraph. We use this algorithm because it provides solutions that are provably optimal, and we show it is very efficient in practice for most region-graphs of reasonable size (hundreds of superpixels).

Training

The visual word histogram weights can be learned directly using a linear SVM.
The contour histogram weights, , are learned in the structured SVM learning framework using the cutting plane algorithm with zero-one loss, where we define the loss to be one for regions with an intersection score of less than $50\%$ with the ground truth.

Results

We use three datasets: the PASCAL VOC 2007, the ETHZ Shapes, and the PASCAL VOC 2008 for evaluation.

Example Detections: ERS vs. ESS

Features
ESS
ERS
ERS-C

First row shows images with the sign of the point feature scores ( sign $(w_v^{c_i})$ ) superimposed: red dots denote negatively weighted features, green dots denote positive features (best viewed in color). Remaining rows show detections returned by ESS and ERS. Last row illustrates how ERS-C can exclude spurious background regions.

Both methods seek the region that will accumulate the most green points while avoiding including excessive red ones. However, since ESS is restricted to finding the max scoring rectangle, it often over/underestimates the object's extent. Our method provides precise arbitrarily shaped detections.

Shape Features: ETHZ

We applied our ERS method using either the point features (blue curves), or the shape features (green curves). Using shape features, we see dramatic gains for almost all objects, since the ETHZ objects have parts well-described by their shape (e.g., mug handle, bottle neck).

Figure: Detection accuracy (top row) and example detections by our method (bottom row) on the ETHZ categories.

State-of-Art: PASCAL 2008

We also compare our approach against the method of [Nowozin et al 2009], which provides an approximate algorithm for using a global connectivity potential within a standard CRF.

	aerop.	bicyc.	bird	boat	bottle	bus	car	cat	chair	cow	dinin.	dog	horse	motor.	person	potte.	sheep	sofa	train	tvmon.	mean
ERS	0.324	0.109	0.268	0.262	0.121	0.405	0.244	0.389	0.120	0.324	0.300	0.288	0.280	0.337	0.257	0.119	0.394	0.224	0.453	0.259	0.274
ERS-C	0.325	0.058	0.257	0.262	0.104	0.405	0.240	0.399	0.097	0.319	0.300	0.249	0.261	0.280	0.249	0.107	0.404	0.210	0.445	0.272	0.262
CRF [Nowozin et al. 2009]	0.380	0.091	0.202	0.275	0.115	0.391	0.185	0.311	0.121	0.236	0.269	0.244	0.209	0.268	0.194	0.075	0.249	0.200	0.393	0.152	0.228

Our ERS approach outperforms the baseline on 17 out of 20 categories, and improves the mean overlap accuracy significantly ( vs. ) which indicates the value of obtaining the optimal solution with our approach as opposed to the approximate inference procedure.

Computation Time

Our approach is very efficient in practice for all the datasets tested, showing the suitability of this PCST reduction for our problem setting. On average it converges in seconds, which is similar to ESS. The longest time taken over any single test image was secs.

Our search times on both PASCAL and ETHZ are similar to ESS's, and both are orders of magnitude faster than sliding window search. (Note time is on log scale.)

Conclusions

We introduced an efficient branch-and-cut method for region-based detection, and with three challenging datasets we demonstrated its advantages over both existing branch-and-bound methods that are limited to searching rectangles and a CRF model.
Our approach is the first that can efficiently identify the subregion that maximizes the additive detector's scoring function.
In future work we will examine the alternate classifiers accepted by our model.

Publication

Efficient Region Search for Object Detection,
S. Vijayanarasimhan and K. Grauman, in CVPR 2011
[paper, code, groundtruth for cat/dog]