Highlights

Abstract

Highlight detection has the potential to significantly ease video browsing, but existing methods often suffer from expensive supervision requirements, where human viewers must manually identify highlights in training videos. We propose a scalable unsupervised solution that exploits video duration as an implicit supervision signal. Our key insight is that video segments from shorter user-generated videos are more likely to be highlights than those from longer videos, since users tend to be more selective about the content when capturing shorter videos. Leveraging this insight, we introduce a novel ranking framework that prefers segments from shorter videos, while properly accounting for the inherent noise in the (unlabeled) training data. We use it to train a highlight detector with 10M hashtagged Instagram videos. In experiments on two challenging public video highlight detection benchmarks, our method substantially improves the state-of-the-art for unsupervised highlight detection.

Overview

We introduce a novel framework for domain-specific highlight detection that addresses both these shortcomings. Our key insight is that user-generated videos, such as those uploaded to Instagram or YouTube, carry a latent supervision signal relevant for highlight detection: their duration.

Video frames from three shorter user-generated video clips (top row) and one longer user-generated video (second row). Although all recordings capture the same event (surfing), video segments from shorter user-generated videos are more likely to be highlights than those from longer videos, since users tend to be more selective about their content. The height of the red curve indicates highlight score over time. We leverage this natural phenomenon as a free latent supervision signal in large-scale Web video.

Our Approach

Overview:

we hypothesize that users tend to be more selective about the content in the shorter videos they upload, whereas their longer videos may be a mix of good and less interesting content. We therefore use the duration of videos as supervision signal. In particular, we propose to learn a scoring function that ranks video segments from shorter videos higher than video segments from longer videos. Since longer videos could also contain highlight moments, we devise the ranking model to effectively handle noisy ranking data.

Demo Video

Example highlight detection:

Comparison to CLA:

Failure cases:

Publication

B. Xiong, Y. Kalantidis, D. Ghadiyaram and K. Grauman "Less is More: Learning Highlight Detection from Video Duration". In CVPR, 2019. [bibtex]

@InProceedings{highlights-cvpr2019,
  author = {B. Xiong, Y. Kalantidis, D. Ghadiyaram and K. Grauman},
  title = {Less is More: Learning Highlight Detection from Video Duration},
  booktitle = {CVPR},
  month = {June},
  year = {2019}
}

Less is More: Learning Highlight Detection from Video Duration

Bo Xiong ¹ Yannis Kalantidis ² Deepti Ghadiyaram ² Kristen Grauman ^1,3

¹The University of Texas at Austin ²Facebook AI ³Facebook AI Research

[paper]

Overview:

Example highlight detection results:

Example highlight detection:

Comparison to CLA:

Failure cases:

Less is More: Learning Highlight Detection from Video Duration

Bo Xiong 1 Yannis Kalantidis 2 Deepti Ghadiyaram 2 Kristen Grauman 1,3

1The University of Texas at Austin 2Facebook AI 3Facebook AI Research

[paper]

Overview:

Example highlight detection results:

Example highlight detection:

Comparison to CLA:

Failure cases:

Bo Xiong ¹ Yannis Kalantidis ² Deepti Ghadiyaram ² Kristen Grauman ^1,3

¹The University of Texas at Austin ²Facebook AI ³Facebook AI Research