Key-Segments for Video
Object Segmentation
Yong Jae Lee, Jaechul Kim, and Kristen Grauman
Summary
We present an approach to discover and segment
foreground object(s) in video. Given an unannotated
video sequence, the method first identifies object-like
regions in any frame according to both static and dynamic
cues. We then compute a series of binary partitions
among those candidate "key-segments" to discover hypotheis
groups with persistent appearance and motion. Finally,
using each ranked hypothesis in turn, we estimate a
pixel-level object labeling across all frames, where (a) the
foreground likelihood depends on both the hypothesis’s
appearance as well as a novel localization prior based on
partial shape matching, and (b) the background likelihood
depends on cues pulled from the key-segments’ (possibly
diverse) surroundings observed across the sequence.
Compared to existing methods, our approach automatically
focuses on the persistent foreground regions of interest while
resisting oversegmentation. We apply our method to
challenging benchmark videos, and show competitive or better
results than the state-of-the-art.
Approach
Our goal is to discover object-like
key-segments in an unlabeled video, and learn appearance and
shape models from them to automatically segment the foreground
objects. The main steps to our approach are shown below.
Algorithm
Overview. See ordered steps (a) through (e).
(a) To find
“object-like” regions among the proposals, we look for
regions that have (1) appearance cues typical to objects in
general, and (2) differences in motion patterns relative to
their surroundings. We define a function S(r) = A(r) +M(r), that
scores a region r
according to its static intra-frame appearance score A(r) and dynamic
inter-frame motion score M(r).
(b) Given the scored regions, we next
identify groups of key-segments that may represent a
foreground object in the video. We perform a form of
spectral clustering with the highest scored regions as input
to produce multiple binary inlier/outlier partitions of the
data. Each cluster (inlier set) is a hypothesis h of a foreground
object’s key-segments. We automatically rank the
clusters based on the average object-like score S(r) of its member
regions. If that scoring is successful, the clusters
among the highest ranks will correspond to the primary
foreground object(s), since they are likely to contain
frequently appearing object-like regions.
(c) We build
foreground and background color Gaussian Mixture Models and
extract a set of shape exemplars for each hypothesis.
(d) We next devise a
space-time Markov Random Field (MRF) that uses these
models to guide a pixel-wise segmentation for the entire
video. We compute color fg/bg estimates using the
GMMs, and estimate fg location priors with the shape
exemplars. The main idea is to use the key-segments
detected across the sequence, projecting their shapes into
other frames via local shape matching. The spatial
extent of that projected shape then serves as a location
and scale prior in which we prefer to label pixels as
foreground. Since we have multiple key-segments and many
possible local shape matches, many such projected shapes
are aggregated together, essentially “voting” for the
location/scale likelihoods.
(e) We compute
the foreground object segmentation by minimizing the energy
function for the space-time MRF using graph cuts.
Results
Publication