Highlight detection has the potential to significantly ease video browsing, but existing methods often suffer from expensive supervision requirements, where human viewers must manually identify highlights in training videos. We propose a scalable unsupervised solution that exploits video duration as an implicit supervision signal. Our key insight is that video segments from shorter user-generated videos are more likely to be highlights than those from longer videos, since users tend to be more selective about the content when capturing shorter videos. Leveraging this insight, we introduce a novel ranking framework that prefers segments from shorter videos, while properly accounting for the inherent noise in the (unlabeled) training data. We use it to train a highlight detector with 10M hashtagged Instagram videos. In experiments on two challenging public video highlight detection benchmarks, our method substantially improves the state-of-the-art for unsupervised highlight detection.
Please check our paper!
We introduce a novel framework for domain-specific highlight detection that addresses both these shortcomings. Our key insight is that user-generated videos, such as those uploaded to Instagram or YouTube, carry a latent supervision signal relevant for highlight detection: their duration.
Video frames from three shorter user-generated video clips (top row) and one longer user-generated video (second row). Although all recordings capture the same event (surfing), video segments from shorter user-generated videos are more likely to be highlights than those from longer videos, since users tend to be more selective about their content. The height of the red curve indicates highlight score over time. We leverage this natural phenomenon as a free latent supervision signal in large-scale Web video.
we hypothesize that users tend to be more selective about the content in the shorter videos they upload, whereas their longer videos may be a mix of good and less interesting content. We therefore use the duration of videos as supervision signal. In particular, we propose to learn a scoring function that ranks video segments from shorter videos higher than video segments from longer videos. Since longer videos could also contain highlight moments, we devise the ranking model to effectively handle noisy ranking data.
B. Xiong, Y. Kalantidis, D. Ghadiyaram and K. Grauman "Less is More: Learning Highlight Detection from Video Duration". In CVPR, 2019. [bibtex]
@InProceedings{highlights-cvpr2019,
author = {B. Xiong, Y. Kalantidis, D. Ghadiyaram and K. Grauman},
title = {Less is More: Learning Highlight Detection from Video Duration},
booktitle = {CVPR},
month = {June},
year = {2019}
}