Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

Kumar Ashutosh^1,2, Santhosh Kumar Ramakrishnan¹,

Triantafyllos Afouras², Kristen Grauman^1,2

¹UT Austin, ²FAIR, Meta

NeurIPS 2023

[Paper] [Code]

Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state---such as the steps of a recipe or a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a predefined sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, and then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional videos, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.

Project Overview

Citation

@inproceedings{ashutosh2024videomined,
 author = {Ashutosh, Kumar and Ramakrishnan, Santhosh Kumar and Afouras, Triantafyllos and Grauman, Kristen},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
 pages = {67833--67846},
 publisher = {Curran Associates, Inc.},
 title = {Video-Mined Task Graphs for Keystep Recognition in Instructional Videos},
 url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/d62e65cfdba247e0cd7cac5964f9fbd9-Paper-Conference.pdf},
 volume = {36},
 year = {2023}
}

Acknowledgements

UT Austin is supported in part by the IFML NSF AI institute. KG is paid as a research scientist at Meta. We thank the authors of Distant Supervision and Paprika for releasing their codebases.