Discovering Important People and Objects for Egocentric Video Summarization
Yong Jae Lee, Joydeep Ghosh, and Kristen Grauman
We present a video summarization approach for egocentric or "wearable" camera data. Given hours of video, the proposed method produces a compact storyboard summary of the camera wearer's day. In contrast to traditional keyframe selection techniques, the resulting summary focuses on the most important objects and people with which the camera wearer interacts. To accomplish this, we develop region cues indicative of high-level saliency in egocentric video---such as the nearness to hands, gaze, and frequency of occurrence---and learn a regressor to predict the relative importance of any new region based on these cues. Using these predictions and a simple form of temporal event detection, our method selects frames for the storyboard that reflect the key object-driven happenings. Critically, the approach is neither camera-wearer-specific nor object-specific; that means the learned importance metric need not be trained for a given user or context, and it can predict the importance of objects and people that have never been seen previously. Our results with 17 hours of egocentric data show the method's promise relative to existing techniques for saliency and summarization.
Our goal is to create a storyboard
summary of a person’s day that is driven by the important
people and objects. We define importance in the scope of
egocentric video: important
things are those with which the camera wearer has
There are four main steps to our approach: (1) using novel egocentric saliency cues to train a category independent regression model that predicts how likely an image region belongs to an important person or object; (2) partitioning the video into temporal events. For each event, (3) scoring each region’s importance using the regressor; and (4) selecting representative key-frames for the storyboard based on the predicted important people and objects.
Egocentric video data
We use the Looxcie wearable camera,
which captures video at 15 fps at 320 x 480 resolution.
We collected 10 videos, each of 3-5 hours in length.
Four subjects wore the camera for us: one undergraduate
student, two grad students, and one office worker. The
videos capture a variety of activities such as eating,
shopping, attending a lecture, driving, and cooking.
regions in training video
In order to learn meaningful egocentric properties without overfitting to any particular category, we crowd-source large amounts of annotations using Amazon’s Mechanical Turk (MTurk). For egocentric videos, the object must be seen in the context of the camera wearer’s activity to properly gauge its importance. We carefully design two annotation tasks to capture this aspect. In the first task, we ask workers to watch a three minute accelerated video and to describe in text what they perceive to be essential people or objects necessary to create a summary of the video. In the second task, we display uniformly sampled frames from the video and their corresponding text descriptions obtained from the first task, and ask workers to draw polygons around any described person or object. See the figure above for example annotations.
Learning region importance
in egocentric video
Given a video, we first generate candidate regions for each frame using the segmentation method of [Carreira and Sminchisescu, CVPR 2010]. We generate roughly 800 regions per frame. For each region, we compute a set of candidate features that could be useful to describe its importance. Since the video is captured by an active participant, we specifically want to exploit egocentric properties such as whether the object/person is interacting with the camera wearer, whether it is the focus of the wearer’s gaze, and whether it frequently appears. The egocentric features are shown above. In addition, we aim to capture high-level saliency cues—such as an object’s motion and appearance, or the likelihood of being a human face—and generic region properties shared across categories, such as size or location. Using all of these features, we train a regression model that can predict a region’s importance.
Segmenting the video into
We first partition the video temporally into events. We cluster scenes in such a way that frames with similar global appearance can be grouped together even when there are a few unrelated frames (“gaps”) between them. Specifically, we perform complete-link agglomerative clustering with a distance matrix that reflects color similarity between each pair of frames weighted by temporal proximity.
Discovering an event’s key
people and objects
Given an event, we
first score each region in each frame using our
regressor. We take the highest-scored regions and group
instances of the same person or object together using a
factorization approach [Perona and Freeman, ECCV 1998].
For each group, we select the region with the highest score as
its representative. Finally, we create a storyboard
visual summary of the video. We display the event
boundaries and frames of the selected important people and
objects. We automatically adjust the compactness of the
summary with selection criteria on the region importance
scores and number of events, as we illustrate in our results.