| Kumar Ashutosh1,2, Tushar Nagarajan2, Georgios Pavlakos2, |
| Kris Kitani2,3, Kristen Grauman1,2 |
|
1UT Austin, 2FAIR, Meta, 3 Carnegie Mellon University CVPR 2025 |
|
|
| Feedback is essential for learning a new skill or improving one's current skill-level. However, current methods for skill-assessment from video only provide scores or compare demonstrations, leaving the burden of knowing what to do differently on the user. We introduce a novel method to generate actionable feedback from video of a person doing a physical activity, such as basketball or soccer. Our method takes a video demonstration and its accompanying 3D body pose and generates (1) free-form expert commentary describing what the person is doing well and what they could improve, and (2) a visual expert demonstration that incorporates the required corrections. We show how to leverage Ego-Exo4D's videos of skilled activity and expert commentary together with a strong language model to create a weakly-supervised training dataset for this task, and we devise a multimodal video-language model to infer coaching feedback. Our method is able to reason across multi-modal input combinations to output full-spectrum, actionable coaching---expert commentary, expert video retrieval, and the first-of-its-kind expert pose generation---outperforming strong vision-language models on both established metrics and human preference studies. |
@misc{ashutosh2024ExpertAF,
title={ExpertAF: Expert Actionable Feedback from Video},
author={Kumar Ashutosh and Tushar Nagarajan and Georgios Pavlakos and Kris Kitani and Kristen Grauman},
year={2024},
eprint={2408.00672},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.00672},
}
|
| UT Austin is supported in part by the IFML NSF AI Institute. Thanks to Fu-Jen Chu, Jing Huang and Xitong Yang for help with the Exo-Ego4D [29] pose extraction pipeline. |
| Copyright © 2024 University of Texas at Austin |