Efficient Activity Detection with Max-Subgraph Search

Chao-Yeh Chen and Kristen Grauman
The University of Texas at Austin

We propose an efficient approach that unifies activity categorization with space-time localization. The main idea is to pose activity detection as a maximum-weight connected subgraph problem over a learned space-time graph constructed on the test sequence. We show this  permits an efficient branch-and-cut solution for the best-scoring—and possibly non-cubically shaped—portion of the video for a given activity  classifier. The upshot is a fast method that can evaluate a broader space of candidates than was previously practical, which we find often leads to more accurate detection. We demonstrate the proposed algorithm on three datasets, and show its speed and accuracy advantages over multiple existing search strategies.


Problem: how to detect human activity in continuous video?

Status quo approaches:

- Expensive: sliding window search.

Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture20.png

 - Restricted shapes: only allows cuboid shape detection.

     Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture21.png

 - Lack context: detection through tracking humans.

Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture22.png


Our Idea:
Pose activity detection as a maximum-weight connected subgraph problem over a learned space-time graph.

Description: Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture18.png

 Highlights:
 
- Obtain exhaustive sliding window search result with much less time.
 - Widen search scope to “non-cubic” detection volumes.
 - Incorporate top-down knowledge of interactions of people/objects.


Approach

 

Overview

Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture23.png

 

(1)        Define weighted nodes

Background: classifier training for feature weights:
 - Activity detection = determine the subvolume  S in a video sequence Q that maximizes the score S*:


 
- Learn a linear SVM from training data, the scoring function would have the form:
 

               where h is the histogram of quantized features (BoF), (
) are the learned weights/bias, i indexes the training examples.

 

- We define  =  jth bin count for histogram , the jth word is associated with a weight    
            

              for j = 1,…,K, where K is the dimension of histogram h.

- Thus the classifier response for subvolume S is:
 

              where ci ∈[1,K], which  is the sum of weights from the features inside the subvolume S.

 Define weighted nodes:

Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture24.png
Divide space-time volume into frame-level/space-time nodes. Compute the weight of nodes from the features inside them.

(2)        Link nodes

Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture25.png

Two different link strategies:

1.     Neighbors only for frame-level nodes(T-Subgraph) or space-time nodes(ST-Subgraph).

2.     First two neighbors for frame-level nodes(T-Jump-Subgraph).

 

(3)        Search for the maximum-weight graph

Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture26.png

- Transform max-weight subgraph problem into a prize-collecting Steiner tree problem.

- Solve efficiently with branch and cut method from [Ljubic et al. 2006].

 

      For example, the solution of T-Subgraph is (4+2) the solution of T-Jump-Subgraph is (4+2+5).

(4)        Back project the selected nodes for the detection result.

Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture27.png

 

 

Localized Space-Time Features

 

- Low-level feature: Histograms of oriented gradients (HoG) + histograms of optical flow (HoF) at interest points/densely sampled.
 

 

- High-level feature: three steps to formulate a descriptor

a.      Detect objects and people from frames using object/pose detector. Cluster poses into N(~10) person types.

b.     For each detection, build a semantic descriptor based on it’s spatial/temporal neighbors.
 

c.      Quantize the semantic descriptor into words via random forest.


 

Results

Properties of three datasets

Evaluation metrics
 - mean overlap accuracy:  

 - detection time: Second

Temporal detection on UCF Concatenated/Hollywood uncropped/MSR Action

 Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture28.png

 - Our T-Jump gives top accuracy for UCF and Hollywood,  showing the advantage of ignoring noisy features.
 - Our ST-Subgraph is most accurate in MSR since it can isolate those nodes that participate in the action.

Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture29.png

- Our T-Subgraph is orders of mag. faster than T-Sliding.

Temporal/Space-time detection on MSR Actions

Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture30.png

- Our ST-Subgraph is also most accurate in terms of space-time overlap accuracy.

-        Top row: yellow box à the space time detection from our ST-Subgraph. Note the detection changes with time. Red boxes: ground truth annotation.

-        Bottom row: green box àtop four detections from the ST-Cube-Subvolume. The top three detections are trapped by the local maximum caused by the small motion of human.

Trade-offs in results

-        Increased search scope boosts accuracy, though costs more.

-        Flexibility of proposed method allows the best speed/accuracy.

Subgraph search with high-level features

Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture31.png
- Qualitative detection result from our high-level features. Note that the red boxes are the space-time detection output, not annotation or human detection result.

Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture32.png
 - Our high-level features obtain higher accuracy in 5 out of 8 categories in Hollywood uncropped dataset.


Conclusion

- Reduced computation time for detection vs. sliding window search.

- Flexible node structure offers more robust detection in noisy backgrounds.

- High-level descriptor shows promise for complex activities by incorporating semantic relationships between humans and objects in video.


Demo video


Code


Poster