Santhosh Kumar Ramakrishnan1
Ziad Al-Halah2
Kristen Grauman1,3
1UT Austin
2University of Utah
3Meta (FAIR)
The goal in episodic memory (EM) is to search a long egocentric video to answer a natural language query (e.g., ``where did I leave my purse?"). Existing EM methods exhaustively extract expensive fixed-length clip features to look everywhere in the video for the answer, which is infeasible for long wearable camera videos that span hours or even days. We propose SpotEM, an approach to achieve efficiency for a given EM method while maintaining good accuracy. SpotEM consists of three key ideas: a novel clip selector that learns to identify promising video regions to search conditioned on the language query; a set of low-cost semantic indexing features that capture the context of rooms, objects, and interactions that suggest where to look; and distillation losses that address optimization issues arising from end-to-end joint training of the clip selector and EM model. Our experiments on 200+ hours of video from the Ego4D EM Natural Language Queries benchmark and three different EM models demonstrate the effectiveness of our approach: computing only 10%-25% of the clip features, we preserve 84%-95%+ of the original EM model's accuracy.
[PDF] [Poster] [SlidesLive recording]
@inproceedings{ramakrishnan2023spotem, author = {Ramakrishnan, Santhosh K. and Al-Halah, Ziad and Grauman, Kristen}, booktitle = {International Conference on Machine Learning}, title = {SpotEM: Efficient Video Search for Episodic Memory}, year = {2023}, organization = {PMLR}, }
Success case #1 |
Success case #2 |
---|---|
Success case #3 |
Success case #4 |
Success case #5 |
|
Failure case #1 |
Failure case #2 |