Santhosh Kumar Ramakrishnan 1
Ziad Al-Halah 2
Kristen Grauman 1,3
1UT Austin
2University of Utah
3FAIR, Meta AI
Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories.
@inproceedings{ramakrishnan2023naq, author = {Ramakrishnan, Santhosh K. and Al-Halah, Ziad and Grauman, Kristen}, booktitle = {Computer Vision and Pattern Recognition (CVPR), 2023 IEEE Conference on}, title = {NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory}, year = {2023}, organization = {IEEE}, }
|
|
---|
Existing NLQ methods are relatively starved for data. Each NLQ annotation contains a ⟨video, query, response⟩
tuple where query
is a natural language question
and response
is a temporal window (tstart, tend)
.
Each query is expensive to annotate since it requires
posing a creative question, scrolling through the long video, and carefully
marking the temporal windows corresponding to the response.
Thus, they are sparsely annotated and only contain queries about a small fraction
of all the video content.
In Narrations-as-Queries (NaQ), we overcome the above limitation by leveraging narrations as queries to supervise
an Episodic Memory system for natural language queries (NLQ). Narrations are timestamped play-by-play descriptions of the camera-wearer's activity.
Each narration annotation consists of ⟨narration, timestamp⟩
tuple.
Narrations are easier to annotate since it requires only pausing the video at regular intervals and typing down what happened.
Therefore, they are densely annotated throughout the video and are available on a larger scale.
Our proposed NaQ is a simple-yet-effective approach to leverage large-scale narrations to augment NLQ training (see below). Next, we introduce our interface to visualize NLQ predictions from our model.
We briefly introduce our visualization interface through an NLQ example below. It consists of the query at the top, the egocentric video in the middle, and the ground-truth (GT) / predicted (Pred.) temporal windows (in seconds) at the bottom. For the predictions, we additionally report the IoU between the GT and Pred. windows, where IoU >= 0.5 is a success. Note how short the GT response (9 seconds) can be relative to the complete video (480 seconds). We fast forward through most of the video, and slow down to highlight frames from the ground-truth (green) or predicted (red / blue) temporal windows.
For brevity, we compress the example into the following interface. We display the query at the top. We then show video snippets corresponding to the ground-truth temporal window, our inferred temporal window with NaQ, and the baseline's inferred temporal window without NaQ. Finally, we plot the temporal windows to show their locations and extents in the full video, along with the IoUs for the predicted windows.
Using this interface, we next visualize several examples to highlight the strengths and weaknesses of NaQ.
We visualize success and failure cases of NaQ on different query templates (template shown below the query).
Examples 1 to 6 are success cases, where training with NaQ
(ReLER* + NaQ
) leads to successful predictions (i.e., IoU >= 0.5)
while the baseline ReLER* fails (i.e., IoU < 0.5).
Examples 7 to 9 are failure cases of NaQ.
1
Query: Who did I interact with when I played with the dog for the second time in the living room?
|
||
Template: Who did I interact with when I did activity X?
|
||
Ground-truth | ReLER* + NaQ |
ReLER* |
---|---|---|
2
Query: What device did I pick up from the saw table?
|
||
Template: What X did I Y?
|
||
Ground-truth | ReLER* + NaQ |
ReLER* |
---|---|---|
3
Query: Where did I put the pieces of meat?
|
||
Template: Where did I put X?
|
||
Ground-truth | ReLER* + NaQ |
ReLER* |
---|---|---|
4
Query: How many funnels are on the shelf?
|
||
Template: How many X's?
|
||
Ground-truth | ReLER* + NaQ |
ReLER* |
---|---|---|
5
Query: Where was the brush before I picked it up?
|
||
Template: Where is object X before / after event Y?
|
||
Ground-truth | ReLER* + NaQ |
ReLER* |
---|---|---|
6
Query: What did I add into the plate?
|
||
Template: What did I put in X?
|
||
Ground-truth | ReLER* + NaQ |
ReLER* |
---|---|---|
7
Query: Where did I pick the cling wrap?
|
||
Template: Where did I put X?
|
||
NaQ failure: overlap with ground-truth below 0.50
|
||
Ground-truth | ReLER* + NaQ |
ReLER* |
---|---|---|
8
Query: Where was the pink flower?
|
||
Template: Where is object X?
|
||
NaQ failure: confuses flower design on paper plate with pink flower
|
||
Ground-truth | ReLER* + NaQ |
ReLER* |
---|---|---|
9
Query: What button did I press?
|
||
Template: What X did I Y?
|
||
NaQ failure: confuses turning stove knob with button press
|
||
Ground-truth | ReLER* + NaQ |
ReLER* |
---|---|---|
Here, we visualize several examples of queries about long-tail objects where training with NaQ improves the ReLER* baseline. We underline the long-tail object in the natural language query. Examples 1 - 3 correspond to mid-shot objects (10 - 50 training samples), and examples 4 - 5 correspond to low-shot objects (2 - 10 training samples). ReLER* fails to identify the query object due to the lack of NLQ training samples, while NaQ succeeds by using knowledge from densely annotated narrations data.
1
Query: Where was the soap before I picked it?
|
||
ReLER* failure: misidentifies a scrubber as the soap | ||
Ground-truth | ReLER* + NaQ |
ReLER* |
---|---|---|
2
Query: What color is the toilet bin?
|
||
ReLER* failure: does not recognize a toilet bin | ||
Ground-truth | ReLER* + NaQ |
ReLER* |
---|---|---|
3
Query: Where did I last put the sieve?
|
||
ReLER* failure: confuses a plate for the sieve | ||
Ground-truth | ReLER* + NaQ |
ReLER* |
---|---|---|
4
Query: What did I take in the lifter?
|
||
ReLER* failure: does not recognize a lifter | ||
Ground-truth | ReLER* + NaQ |
ReLER* |
---|---|---|
5
Query: Where was the brake pad before I took it?
|
||
ReLER* failure: misidentifies a spanner as the brake pad | ||
Ground-truth | ReLER* + NaQ |
ReLER* |
---|---|---|
In Section 4.3 from our paper, we discussed how NaQ can facilitate zero-shot NLQ, i.e., training models with no NLQ training data and only NaQ training data. We further demonstrated in Section 5 from our supplementary that the zero-shot performance matches or outperforms the baseline which uses the whole NLQ train dataset (and no NaQ) on a subset of object/place templates. We now show qualitative examples of zero-shot NLQ predictions enabled by NaQ.
1
Query: Where was the phone before I operated it?
|
|
Ground-truth | NaQ (zero-shot) |
---|---|
2
Query: What color is the towel I wiped hands with?
|
|
Ground-truth | NaQ (zero-shot) |
---|---|
3
Query: Where did i put the planks i packed from the floor?
|
|
Ground-truth | NaQ (zero-shot) |
---|---|
4
Query: Where did I put the boots?
|
|
Ground-truth | NaQ (zero-shot) |
---|---|
5
Query: Did I rinse the knife?
|
|
Ground-truth | NaQ (zero-shot) |
---|---|