Abstract

Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories.

[PDF] [Code]

Citation

@inproceedings{ramakrishnan2023naq,
    author       = {Ramakrishnan, Santhosh K. and Al-Halah, Ziad and Grauman, Kristen},
    booktitle    = {Computer Vision and Pattern Recognition (CVPR), 2023 IEEE Conference on},
    title        = {NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory},
    year         = {2023},
    organization = {IEEE},
}

Motivation

NLQ annotations Narration annotations

Existing NLQ methods are relatively starved for data. Each NLQ annotation contains a ⟨video, query, response⟩ tuple where query is a natural language question and response is a temporal window (tstart, tend). Each query is expensive to annotate since it requires posing a creative question, scrolling through the long video, and carefully marking the temporal windows corresponding to the response. Thus, they are sparsely annotated and only contain queries about a small fraction of all the video content.

In Narrations-as-Queries (NaQ), we overcome the above limitation by leveraging narrations as queries to supervise an Episodic Memory system for natural language queries (NLQ). Narrations are timestamped play-by-play descriptions of the camera-wearer's activity. Each narration annotation consists of ⟨narration, timestamp⟩ tuple. Narrations are easier to annotate since it requires only pausing the video at regular intervals and typing down what happened. Therefore, they are densely annotated throughout the video and are available on a larger scale.

Our proposed NaQ is a simple-yet-effective approach to leverage large-scale narrations to augment NLQ training (see below). Next, we introduce our interface to visualize NLQ predictions from our model.


NLQ visualization interface

We briefly introduce our visualization interface through an NLQ example below. It consists of the query at the top, the egocentric video in the middle, and the ground-truth (GT) / predicted (Pred.) temporal windows (in seconds) at the bottom. For the predictions, we additionally report the IoU between the GT and Pred. windows, where IoU >= 0.5 is a success. Note how short the GT response (9 seconds) can be relative to the complete video (480 seconds). We fast forward through most of the video, and slow down to highlight frames from the ground-truth (green) or predicted (red / blue) temporal windows.

For brevity, we compress the example into the following interface. We display the query at the top. We then show video snippets corresponding to the ground-truth temporal window, our inferred temporal window with NaQ, and the baseline's inferred temporal window without NaQ. Finally, we plot the temporal windows to show their locations and extents in the full video, along with the IoUs for the predicted windows.

Using this interface, we next visualize several examples to highlight the strengths and weaknesses of NaQ.


NaQ benefits performance on most query templates

We visualize success and failure cases of NaQ on different query templates (template shown below the query). Examples 1 to 6 are success cases, where training with NaQ (ReLER* + NaQ) leads to successful predictions (i.e., IoU >= 0.5) while the baseline ReLER* fails (i.e., IoU < 0.5). Examples 7 to 9 are failure cases of NaQ.

1
Query: Who did I interact with when I played with the dog for the second time in the living room?
Template: Who did I interact with when I did activity X?
Ground-truth ReLER* + NaQ ReLER*
2
Query: What device did I pick up from the saw table?
Template: What X did I Y?
Ground-truth ReLER* + NaQ ReLER*
3
Query: Where did I put the pieces of meat?
Template: Where did I put X?
Ground-truth ReLER* + NaQ ReLER*
4
Query: How many funnels are on the shelf?
Template: How many X's?
Ground-truth ReLER* + NaQ ReLER*
5
Query: Where was the brush before I picked it up?
Template: Where is object X before / after event Y?
Ground-truth ReLER* + NaQ ReLER*
6
Query: What did I add into the plate?
Template: What did I put in X?
Ground-truth ReLER* + NaQ ReLER*
7
Query: Where did I pick the cling wrap?
Template: Where did I put X?
NaQ failure: overlap with ground-truth below 0.50
Ground-truth ReLER* + NaQ ReLER*
8
Query: Where was the pink flower?
Template: Where is object X?
NaQ failure: confuses flower design on paper plate with pink flower
Ground-truth ReLER* + NaQ ReLER*
9
Query: What button did I press?
Template: What X did I Y?
NaQ failure: confuses turning stove knob with button press
Ground-truth ReLER* + NaQ ReLER*

NaQ benefits performance on queries about long-tail objects

Here, we visualize several examples of queries about long-tail objects where training with NaQ improves the ReLER* baseline. We underline the long-tail object in the natural language query. Examples 1 - 3 correspond to mid-shot objects (10 - 50 training samples), and examples 4 - 5 correspond to low-shot objects (2 - 10 training samples). ReLER* fails to identify the query object due to the lack of NLQ training samples, while NaQ succeeds by using knowledge from densely annotated narrations data.

1
Query: Where was the soap before I picked it?
ReLER* failure: misidentifies a scrubber as the soap
Ground-truth ReLER* + NaQ ReLER*
2
Query: What color is the toilet bin?
ReLER* failure: does not recognize a toilet bin
Ground-truth ReLER* + NaQ ReLER*
3
Query: Where did I last put the sieve?
ReLER* failure: confuses a plate for the sieve
Ground-truth ReLER* + NaQ ReLER*
4
Query: What did I take in the lifter?
ReLER* failure: does not recognize a lifter
Ground-truth ReLER* + NaQ ReLER*
5
Query: Where was the brake pad before I took it?
ReLER* failure: misidentifies a spanner as the brake pad
Ground-truth ReLER* + NaQ ReLER*

NaQ facilitates zero-shot NLQ

In Section 4.3 from our paper, we discussed how NaQ can facilitate zero-shot NLQ, i.e., training models with no NLQ training data and only NaQ training data. We further demonstrated in Section 5 from our supplementary that the zero-shot performance matches or outperforms the baseline which uses the whole NLQ train dataset (and no NaQ) on a subset of object/place templates. We now show qualitative examples of zero-shot NLQ predictions enabled by NaQ.

1
Query: Where was the phone before I operated it?
Ground-truth NaQ (zero-shot)
2
Query: What color is the towel I wiped hands with?
Ground-truth NaQ (zero-shot)
3
Query: Where did i put the planks i packed from the floor?
Ground-truth NaQ (zero-shot)
4
Query: Where did I put the boots?
Ground-truth NaQ (zero-shot)
5
Query: Did I rinse the knife?
Ground-truth NaQ (zero-shot)