Efficient Activity Detection with Max-Subgraph Search

Chao-Yeh Chen and Kristen Grauman
The University of Texas at Austin

We propose an efﬁcient approach that uniﬁes activity categorization with space-time localization. The main idea is to pose activity detection as a maximum-weight connected subgraph problem over a learned space-time graph constructed on the test sequence. We show this permits an efﬁcient branch-and-cut solution for the best-scoring—and possibly non-cubically shaped—portion of the video for a given activity classiﬁer. The upshot is a fast method that can evaluate a broader space of candidates than was previously practical, which we ﬁnd often leads to more accurate detection. We demonstrate the proposed algorithm on three datasets, and show its speed and accuracy advantages over multiple existing search strategies.

Problem: how to detect human activity in continuous video?

Status quo approaches:

- Expensive: sliding window search.

$Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture20.png$

- Restricted shapes: only allows cuboid shape detection.

$Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture21.png$

- Lack context: detection through tracking humans.

$Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture22.png$

Our Idea:
Pose activity detection as a maximum-weight connected subgraph problem over a learned space-time graph.

$Description: Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture18.png$

Highlights:
- Obtain exhaustive sliding window search result with much less time.
- Widen search scope to “non-cubic” detection volumes.
- Incorporate top-down knowledge of interactions of people/objects.

Approach

Overview

$Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture23.png$

(1) Define weighted nodes

Background: classifier training for feature weights:
- Activity detection = determine the subvolume S in a video sequence Q that maximizes the score S*:

- Learn a linear SVM from training data, the scoring function would have the form:

where h is the histogram of quantized features (BoF), () are the learned weights/bias, i indexes the training examples.

- We define = j^th bin count for histogram , the j^th word is associated with a weight

              for j = 1,…,K, where K is the dimension of histogram h.

- Thus the classifier response for subvolume S is:

where c_i∈[1,K], which is the sum of weights from the features inside the subvolume S.

Define weighted nodes:

$Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture24.png$
Divide space-time volume into frame-level/space-time nodes. Compute the weight of nodes from the features inside them.

(2) Link nodes

$Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture25.png$

Two different link strategies:

1. Neighbors only for frame-level nodes(T-Subgraph) or space-time nodes(ST-Subgraph).

2. First two neighbors for frame-level nodes(T-Jump-Subgraph).

(3) Search for the maximum-weight graph

$Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture26.png$

- Transform max-weight subgraph problem into a prize-collecting Steiner tree problem.

- Solve efficiently with branch and cut method from [Ljubic et al. 2006].

For example, the solution of T-Subgraph is (4+2) the solution of T-Jump-Subgraph is (4+2+5).

(4) Back project the selected nodes for the detection result.

$Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture27.png$

Localized Space-Time Features

- Low-level feature: Histograms of oriented gradients (HoG) + histograms of optical flow (HoF) at interest points/densely sampled.

- High-level feature: three steps to formulate a descriptor

a. Detect objects and people from frames using object/pose detector. Cluster poses into N(~10) person types.

b. For each detection, build a semantic descriptor based on it’s spatial/temporal neighbors.

c. Quantize the semantic descriptor into words via random forest.

Results

Properties of three datasets

Evaluation metrics
- mean overlap accuracy:

- detection time: Second

Temporal detection on UCF Concatenated/Hollywood uncropped/MSR Action

$Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture28.png$

- Our T-Jump gives top accuracy for UCF and Hollywood, showing the advantage of ignoring noisy features.
- Our ST-Subgraph is most accurate in MSR since it can isolate those nodes that participate in the action.

$Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture29.png$

- Our T-Subgraph is orders of mag. faster than T-Sliding.

Temporal/Space-time detection on MSR Actions

$Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture30.png$

- Our ST-Subgraph is also most accurate in terms of space-time overlap accuracy.

- Top row: yellow box à the space time detection from our ST-Subgraph. Note the detection changes with time. Red boxes: ground truth annotation.

- Bottom row: green box àtop four detections from the ST-Cube-Subvolume. The top three detections are trapped by the local maximum caused by the small motion of human.

Trade-offs in results

- Increased search scope boosts accuracy, though costs more.

- Flexibility of proposed method allows the best speed/accuracy.

Subgraph search with high-level features

$Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture31.png$
- Qualitative detection result from our high-level features. Note that the red boxes are the space-time detection output, not annotation or human detection result.

$Description: Description: Description: Description: Description: Description: C:\Users\chaoyeh\Desktop\web\Picture32.png$
- Our high-level features obtain higher accuracy in 5 out of 8 categories in Hollywood uncropped dataset.

Conclusion

- Reduced computation time for detection vs. sliding window search.

- Flexible node structure offers more robust detection in noisy backgrounds.

- High-level descriptor shows promise for complex activities by incorporating semantic relationships between humans and objects in video.

Demo video

Code

Poster