UniversalVTG: A Univeral and Lightweight Foundation Model for Video Temporal Grounding

The University of Texas at Austin

We introduce UniversalVTG, a universal and lightweight foundation model for video temporal grounding, trained with large-scale cross-dataset supervision and canonicalized textual inputs to enable strong generalization across heterogeneous domains and query formulations.

Demo Video

Interactive Demo

Try UniversalVTG online — type a natural-language query and see temporal grounding results in real time.

UniversalVTG Interactive Demo — video temporal grounding with natural-language queries
Launch Demo

Abstract

Video temporal grounding (VTG) is often tackled with dataset-specific models that transfer poorly across domains and query styles. Recent efforts adapt large multimodal language models (MLLMs), but their high compute cost and limited video context hinder long-video grounding. We instead scale unified supervision while keeping the model lightweight. We present UniversalVTG, a single VTG model trained with large-scale cross-dataset pretraining. An offline Query Unifier canonicalizes heterogeneous query formats into a shared declarative space, reducing linguistic mismatch and preventing negative transfer under naïve joint training. Combined with an efficient grounding head, UniversalVTG scales to long, untrimmed videos. Across diverse benchmarks—GoalStep, Ego4D-NLQ, TACoS, Charades-STA, and ActivityNet-Captions—a single UniversalVTG checkpoint achieves state-of-the-art performance among dedicated VTG models. Despite being >100× smaller than recent MLLM-based approaches, it matches or exceeds their accuracy, offering a practical and efficient alternative for real-world deployment.

teaser

UniversalVTG Framework. A single, lightweight model generalizes across heterogeneous video domains and query styles. Left: Diverse videos are mapped into a shared representation via an efficient backbone, while a Query Unifier canonicalizes multi-style text inputs into a unified semantic space. Right: With both modalities standardized, a single grounding head localizes events across varied viewpoints (ego/exo), durations (short/long), and linguistic forms (e.g., questions, declarations). UniversalVTG achieves real-time inference (~10 ms/query), making it suitable for long-form video deployment.

Method

method figure

Universal without compromise: a single UNIVERSALVTG checkpoint rivals dataset-specific SOTA

BibTeX

@article{an2026universalvtg,
      title={UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding},
      author={An, Joungbin and Jain, Agrim and Grauman, Kristen},
      journal={},
      year={2026}
    }