Stitch-a-Demo: Video Demonstrations from Multistep Descriptions

Chi Hsuan Wu*, Kumar Ashutosh*, Kristen Grauman
UT Austin
* Equal contribution
arXiv 2025


Method Overview

When obtaining visual illustrations from text descriptions, today's methods take a description with a single text context—a caption, or an action description—and retrieve or generate the matching visual context. However, prior work does not permit visual illustration of multistep descriptions, e.g. a cooking recipe or a gardening instruction manual, and simply handling each step description in isolation would result in an incoherent demonstration. We propose Stitch-a-Demo, a novel retrieval-based method to assemble a video demonstration from a multistep description. The resulting video contains clips, possibly from different sources, that accurately reflect all the step descriptions, while being visually coherent. We formulate a training pipeline that creates large-scale weakly supervised data containing diverse procedures and injects hard negatives that promote both correctness and coherence. Validated on in-the-wild instructional videos, Stitch-a-Demo achieves state-of-the-art performance, with gains up to 29% as well as dramatic wins in a human preference study.

Project Overview


Examples


Citation

                  @misc{wu2025stitchademovideodemonstrationsmultistep,
                        title={Stitch-a-Demo: Video Demonstrations from Multistep Descriptions}, 
                        author={Chi Hsuan Wu and Kumar Ashutosh and Kristen Grauman},
                        year={2025},
                        eprint={2503.13821},
                        archivePrefix={arXiv},
                        primaryClass={cs.CV},
                        url={https://arxiv.org/abs/2503.13821}, 
                  }
		            

Copyright © 2025 University of Texas at Austin