Active Frame Selection for Label Propagation in Videos

Sudheendra Vijayanarasimhan and Kristen Grauman
University of Texas at Austin
[pdf] [code] [data]


1 Abstract

Manually segmenting and labeling objects in video sequences is quite tedious, yet such annotations are valuable for learning-based approaches to object and activity recognition. While automatic label propagation can help, existing methods simply propagate annotations from arbitrarily selected frames (e.g., the first one) and so may fail to best leverage the human effort invested. We define an active frame selection problem: select $ k$ frames for manual labeling, such that automatic pixel-level label propagation can proceed with minimal expected error. We propose a solution that directly ties a joint frame selection criterion to the predicted errors of a flow-based random field propagation model. It selects the set of $ k$ frames that together minimize the total mislabeling risk over the entire sequence. We derive an efficient dynamic programming solution to optimize the criterion. Further, we show how to automatically determine how many total frames $ k$ should be labeled in order to minimize the total manual effort spent labeling and correcting propagation errors. We demonstrate our method's clear advantages over several baselines, saving hours of human effort per video.

Figure 1: Goal: actively select $ k$ video frames for a human to label, so as to ensure minimal expected error across the entire sequence after automatic propagation.
\includegraphics[width=0.9\linewidth]{figs/concept.eps}

2 Approach

Our approach explicitily models "trackability" and uses it in the selection criterion to choose the $ k$ most informative frames. Actively selecting these frames aims to minimize the manual effort spent in labeling the entire video sequence. Once the selected frames are labeled by human annotator, the labels are propagated to the rest of frames using optical flow and appearance based models.

Figure 2: Video Label Propagation
\includegraphics[width=0.7\linewidth]{figs/flow_labels3.eps} \includegraphics[width=0.9\linewidth]{figs/propagation2.eps}
(a) Relationship between flow and label transfer ($ \bm{w}$ is the optical flow field). (b) Schematic for the propagation process to label frame $ t$. Dark blue frames denote the selected frames. Dotted arrows denote optical flow or MRF-based propagation from adjacent frames.


Pixel Flow Propagation Method

The basic propagation method uses dense optical flow to track every pixel in both the forward and backward directions until it reaches the closest labeled frames on either side. The pixel takes its label from the labeled frame (either left or right) with smaller expected propagation error.


Pixel Flow + MRF Propagation Method

A space-time Markov Random Field is used to enhance the basic flow based label propagation method. Using the motion based data terms and the usual MRF smoothness constraints, it infers label maps which are smooth in space and time. The object appearance model is learnt from the labeled frames.

Estimating Expected Label Propagation Error

We explicitly model the probability that a pixel is mistracked, by taking into account the errors which occur due to boundaries, occlusions, and when pixels change in appearance, or enter/leave the frame. We define the probability that pixel $ \bm{p}$ in frame $ t$ will be mislabeled if we were to obtain its label from frame $ t+1$ as:

$\displaystyle \mathrm{P}(\bm{p}, t+1, t) = 1 - \exp(-d(\bm{p},t)),   $where    

$\displaystyle d(\bm{p},t) = \beta \left(d_{app}(\bm{p},t) + d_{mot}(\bm{p},t) + d_{occ}(\bm{p},t) + d_{out}(\bm{p},t)\right)\nonumber   $$\displaystyle \mbox{($\beta$ is a scaling parameter)}$    

The component distances reflect the expected forms of tracking error and are defined using pixel appearance($ \bm{f}$) and optical flow field($ \bm{w}$) . Specifically,

$\displaystyle \hspace*{5pt} d_{app}(\bm{p},t) = \Vert f_t(\bm{p}) - f_{t+1}(\bm{p}^\prime)\Vert \hspace*{30pt}   $(Color Difference)    

$\displaystyle d_{mot}(\bm{p},t) = \Vert\bm{w}_t(\bm{p}) - \bm{w}_{t+1}(\bm{p}^\prime)\Vert \hspace*{25pt}   $(Flow Difference)    

$\displaystyle \hspace*{30pt} d_{occ}(\bm{p},t) = \frac{\Vert\bm{w}_t(\bm{p}) + ...
...t\bm{w}_t(\bm{p})\Vert+\Vert\hat{\bm{w}}_{t+1}(\bm{p})\Vert} \hspace*{15pt}   $(Occlusion Detection)    

$\displaystyle d_{out}(\bm{p},t) = \left\{ \begin{array}{rl} 0 &   \mbox{if $\b...
... R &   \mbox{if $\bm{p}^\prime$ has left the frame} \end{array} \right.   $(Pixels leaving the frame)    

When there is more than one frame between labeled frame $ r_t$ and current frame $ t$, the error can be computed in a recursive manner (and analogously for $ l_t$),

$\displaystyle P(\bm{p}, t+j, t) = P(\bm{p},t+j-1,t) + (1 -P(\bm{p}, t+j-1, t)) P(\bm{p}, t+j, t+j-1)$    

Active Frame Selection


Selection Criterion

To get a well-segmented video, there are two sources of manual effort cost:
  1. Cost of fully labeling a frame from scratch, denoted $ C_{\ell}$
  2. Cost of correcting errors by the automatic propagation, denoted $ C_c$.

We now define an optimization problem for the best set of frames from which to propagate. Our aim is to choose subset $ S^\ast$ of frames to minimize the total expected effort:

$\displaystyle S^\ast, k^\ast$ $\displaystyle =$ $\displaystyle \argmin_{S \subset F, k} k  C_{\ell} + \mathbb{E}(S) C_c,$      where  
$\displaystyle \mathbb{E}(S)$ $\displaystyle =$ $\displaystyle \sum_{t=1}^N \sum_{\bm{p} \in t} \min_{j \in \{l_t, r_t\}} \mathrm{P}(\bm{p}, j, t).$  

Since choosing which frame to propagate per pixel adds a factor of height$ \times$width to the computation time, we modify this to select which frame to propagate per frame. Thus we can rewrite the cost in terms of an $ N\times N$ matrix $ C$, where $ C(j,t) = \sum_{\bm{p} \in t} \mathrm{P}(\bm{p}, j,t)$:

$\displaystyle \mathbb{E}(S) = \sum_{t=1}^N \min_{j \in \{l_t, r_t\}} C(j, t).$    


Dynamic Programming Solution

Let $ T(i, b, n)$ be the optimal value of $ \mathbb{E}(\cdot)$ for selecting $ b$ frames from the first $ n$ frames, where $ i$ denotes the index of the $ b$-th selected frame.
Figure 3: Dynamic Programming based solution for active frame selection problem.
\includegraphics[width=0.7\linewidth, height=0.25\linewidth]{figs/cases0.eps} \includegraphics[width=0.4\linewidth, height=0.25\linewidth]{figs/cases1.eps} \includegraphics[width=0.4\linewidth, height=0.25\linewidth]{figs/cases2.eps} \includegraphics[width=0.4\linewidth, height=0.25\linewidth]{figs/cases3.eps}
(a) Sketch to illustrate the index notation (b) Case 1: 1-way $ \rightarrow$ end (c) Case 2: 1-way $ \rightarrow$ beginning (d) Case 3: Both ways

    
Case 1: 1-way $ \rightarrow$ end. $ n > i$

$\displaystyle T(i, b, n) = T(i, b, n-1) + C(i, n).$

    
Case 2: 1-way $ \rightarrow$ beginning. $ b = 1$ and $ n =i$

$\displaystyle T(i, b, n) = \sum_{j=1}^{i-1} C(i, j).$

    
Case 3: Both ways. $ b > 1$ and $ n =i$

$\displaystyle T(i, b, n) = \min_{j=b-1}^{i-1} T(j, b-1, n-1) - \sum_{m=j+1}^{i-1} C(j,m)+\sum_{m=j+1}^{i-1} \min (C(j,m),C(i,m)).$

Once $ T$ is computed, we obtain the optimal value for a given $ k$ (where $ i$ starts at $ k$ since the minimum selected index for $ k$ total frames is $ k$) as:

$\displaystyle \mathbb{E}(S^\ast) = \min_{i\in \{k,\dots,N\}} T(i, k, N)$    

3 Results

Datasets:
We use four publicly available datasets: (1) Camseq01: 101 frames of a moving driving scene. (2) Camvid_seq05: first 3000 frames from 0005VD sequence depicting a driving scene. (3) Labelme_8126: 167 frames depicting a traffic signal. (4) Segtrack: 6 videos with moving objects.

Baselines:
We compare our results with the following baselines:

  1. Uniform-f: samples $ k$ frames uniformly starting with the first frame, and propagates labels in the forward direction only using our pixel flow method.
  2. Uniform: samples $ k$ frames uniformly and transfers labels in both directions. Each frame obtains its labels from the closest labeled frame.
  3. Keyframe: selects $ k$ representative frames by performing $ k$-way spectral clustering on global Gist features extracted for each frame. It requests labels for the frame per cluster with highest intra-cluster affinity.

Our Approach:
We evaluate three variants of our approach:

  1. DP-PF: selects $ k$ frames using our dynamic programming (DP) algorithm and propagates labels using our pixel flow approach.
  2. DP-MRF: selects $ k$ frames using our DP algorithm and propagates using our MRF-based formulation.
  3. DP2-MRF: automatically selects the number of frames and their indices by minimizing total annotation cost.

Error Prediction Model:

Figure 4: Comparison of ground truth label propagation error with the error predicted by our model ($ C$) for Labelme. Our error predictions follow the actual errors fairly closely.
\includegraphics[width=0.5\linewidth]{figs/labelme-predrisk.eps} \includegraphics[width=0.5\linewidth]{figs/labelme-actrisk.eps}

Fixed size selections:


Table 1: Results on Labelme, Camseq, and Camvid datasets. Values are average number of incorrect pixels (the standard metric in prior work) over all frames in hundreds of pixels for our method and the 3 baselines, for varying $ k$ values. In all cases, our active approach outperforms the baselines, and yields significant savings in human annotation time (last row).
\scalebox{1}{
\begin{tabular}{\vert c\vert c\vert\vert cccc\vert\vert cccc\vert\...
...{\vert c\vert\vert}{best baseline (mins)} &&&&&&&&&&&&&\\
\hline
\end{tabular}}



Table 2: Results on the Segtrack dataset. Values denote pixel errors when selecting k=5 frames for annotation.
\scalebox{1}{
\begin{tabular}{\vert c\vert cccccc\vert}
\hline
& birdfall2 & g...
...irow{2}{*}{-75.4} \\
best baseline (secs) & & & & & & \\
\hline
\end{tabular}}


Figure 5: Each method's accuracy plotted for all frames, for two values of $ k$ per sequence. Accuracy values are sorted from high to low per method for clarity. Our DP approaches (darker blue curves) have higher accuracy, esp. on frames far away from labeled frames.It reduces effort better than the baselines, and can also predict the optimal number of frames to have labeled, $ k^{*}$.
\includegraphics[width=0.5\linewidth]{figs/labelme-sortedacc-7.eps} \includegraphics[width=0.5\linewidth]{figs/labelme-sortedacc-15.eps} \includegraphics[width=0.5\linewidth]{figs/camseq01-sortedacc-7.eps}
\includegraphics[width=0.5\linewidth]{figs/camseq01-sortedacc-15.eps} \includegraphics[width=0.5\linewidth]{figs/camvid-sortedacc-15.eps} \includegraphics[width=0.5\linewidth]{figs/camvid-sortedacc-30.eps}

Our active frame selection approach outperforms the baselines in most cases, and saves hours of manual effort per video, if cost to correct errors is proportional to number of mislabeled pixels.

Minimizing total annotation cost:

Figure 6: Total human annotation time required to label each sequence, as a function of selections made per method. Darker lines are ours. This result clearly shows the trade-off between the two types of effort--initial frame labeling, and correcting the algorithm's errors. For all methods, the total annotation time has a sweet spot where the combined effort cost is minimized. Our method can automatically predict the optimal number of frames to be labeled, which is close to the actual minimum.
\includegraphics[width=0.5\linewidth]{figs/labelme-totalcost.eps} \includegraphics[width=0.5\linewidth]{figs/labelme-totalcost.eps} \includegraphics[width=0.5\linewidth]{figs/camseq01-totalcost.eps}

Sample Frame Selections:

Figure 7: Frames selected by our approach for $ k = 7$ on the Camseq01 sequence. We see our approach selects non-uniformly spaced frames so that they contain high resolution information of most of the objects that occur in the video (the two cars, bicyclists, pedestrians).
\includegraphics[width=0.3\linewidth]{figs/selections7-1.eps} \includegraphics[width=0.3\linewidth]{figs/selections7-2.eps} \includegraphics[width=0.3\linewidth]{figs/selections7-3.eps} \includegraphics[width=0.3\linewidth]{figs/selections7-4.eps} \includegraphics[width=0.3\linewidth]{figs/selections7-5.eps} \includegraphics[width=0.3\linewidth]{figs/selections7-6.eps} \includegraphics[width=0.3\linewidth]{figs/selections7-7.eps}

Conclusions

Publication

Active Frame Selection for Label Propagation in Videos.
S. Vijayanarasimhan and K. Grauman, In ECCV, 2012
[pdf] [code] [data]

The datasets tested in the paper:

This research is supported in part by the ONR Young Investigator Program, grant #N00014-12-1-0754