Collect-Cut: Segmentation with Top-Down Cues

Discovered in Multi-Object Images

Yong Jae Lee and Kristen Grauman

University of Texas at Austin

Summary

We present a method to segment a collection of unlabeled images while exploiting automatically discovered appearance patterns shared between them. Given an unlabeled pool of multi-object images, we first detect any visual clusters present among their sub-regions, where inter-region similarity is measured according to both appearance and contextual layout. Then, using each initial segment as a seed, we solve a graph cuts problem to refine its boundary---enforcing preferences to include nearby regions that agree with an ensemble of representative regions discovered for that cluster, and exclude those regions that resemble familiar objects. Through extensive experiments, we show that the segmentations computed jointly on the collection agree more closely with true object boundaries, when compared to either a bottom-up baseline or a graph cuts foreground segmentation that can only access cues from a single image.

System Overview

We use graph cuts to minimize an energy function that encodes top-down object cues, entropy-based background cues, and neighborhood smoothness constraints. A node in the graph corresponds to a superpixel and an edge between two nodes corresponds to the cost of a cut between two superpixels. In this example, suppose the familiar object categories are building and road.

(a) A set of k clusters of regions. We employ our algorithm for “context-aware” visual category discovery to map an unlabeled collection of images to a set of clusters.

(b) An initial region from the pool generated from multiple-segmentations.

(c) Ensemble cluster exemplars which we use to encode top-down cues. For each cluster, we extract r representative region exemplars to serve as its top-down model of appearance. Though individually the ensemble's regions may be short of an entire object, as a group they represent the variable appearances that arise within generic intra-category instances. When refining a region's boundaries, the idea is to treat resemblance to any one of the representative ensemble regions as support for the object of interest.

(d) Background exemplars and entropy map to encode background preference for familiar objects. Darker regions are more “known”, i.e., more likely to be background.

(e) Soft boundary map produced by the gPb detector. Our smoothness term favors assigning the same label to neighboring superpixels that have similar color and texture and have low probability of an intervening contour.

(f) Our final refined segmentation for the region under consideration. Note that a single-image graph-cuts segmentation using the initial seed region as foreground and the remaining regions as background would likely have oversegmented the car, due to the top half of the car having different appearance from the seed region.

Results

We tested our method on two datasets: MSRC-v2 and MSRC-v0.

Quantitative Results

The figure above shows segmentation overlap scores for both datasets, when tested (a) with the context of familiar objects or (b) without. Higher values are better---a score of 1 would mean 100% pixel-for-pixel agreement with ground truth object segmentation. By collectively segmenting the images, our method's results (right box-plots) are substantially better aligned with the true object boundaries, as compared to both the initial bottom-up multiple segmentations (left box-plots), as well as a graph cuts baseline that can use only cues from a single image at once (middle box-plots).

The figure above shows the impact of collective segmentation on discovery accuracy, as evaluated by the F-measure (higher values are better). For discovery, we plug in both (a) our context-aware clustering algorithm, and (b) an appearance-only discovery method. In both cases, using our Collect-Cut algorithm to refine the original bottom-up segments yields more accurate grouping.

Qualitative Results

Qualitative comparison: our results vs. the best corresponding segment available in the pool of multiple-segmentations. The first 8 columns are examples where our method performs well, extracting the true object boundaries much more closely than the bottom-up segmentation can. The last 2 columns show failure cases for our method. It does not perform as well for images where the multiple objects have very similar color/texture, or when the ensembles are too noisy.

Examples of high quality multi-object segmentation results. We aggregate our method’s refined object regions into a single image-level segmentation.

Publication

Collect-Cut: Segmentation with Top-Down Cues Discovered in Multi-Object Images [pdf] [supp] [data]

Yong Jae Lee and Kristen Grauman
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, June 2010.

Related Paper:

Object-Graphs for Context-Aware Category Discovery [pdf] [project page]

Yong Jae Lee and Kristen Grauman
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, June 2010.