Video-Mined Task Graphs for Keystep Recognition in Instructional Videos

Kumar Ashutosh1,2, Santhosh Kumar Ramakrishnan1,
Triantafyllos Afouras2, Kristen Grauman1,2
1UT Austin, 2FAIR, Meta

NeurIPS 2023

Method Overview

Procedural activity understanding requires perceiving human actions in terms of a broader task, where multiple keysteps are performed in sequence across a long video to reach a final goal state---such as the steps of a recipe or a DIY fix-it task. Prior work largely treats keystep recognition in isolation of this broader structure, or else rigidly confines keysteps to align with a predefined sequential script. We propose discovering a task graph automatically from how-to videos to represent probabilistically how people tend to execute keysteps, and then leverage this graph to regularize keystep recognition in novel videos. On multiple datasets of real-world instructional videos, we show the impact: more reliable zero-shot keystep localization and improved video representation learning, exceeding the state of the art.

Project Overview


 author = {Ashutosh, Kumar and Ramakrishnan, Santhosh Kumar and Afouras, Triantafyllos and Grauman, Kristen},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
 pages = {67833--67846},
 publisher = {Curran Associates, Inc.},
 title = {Video-Mined Task Graphs for Keystep Recognition in Instructional Videos},
 url = {},
 volume = {36},
 year = {2023}

UT Austin is supported in part by the IFML NSF AI institute. KG is paid as a research scientist at Meta. We thank the authors of Distant Supervision and Paprika for releasing their codebases.

Copyright © 2023 University of Texas at Austin