UniversalVTG: A Univeral and Lightweight Foundation Model for Video Temporal Grounding

Joungbin An, Agrim Jain, Kristen Grauman

The University of Texas at Austin

Paper Code Demo

We introduce UniversalVTG, a universal and lightweight foundation model for video temporal grounding, trained with large-scale cross-dataset supervision and canonicalized textual inputs to enable strong generalization across heterogeneous domains and query formulations.

Interactive Demo

Try UniversalVTG online — type a natural-language query and see temporal grounding results in real time.

Launch Demo →

Abstract

Video temporal grounding (VTG) is often tackled with dataset-specific models that transfer poorly across domains and query styles. Recent efforts adapt large multimodal language models (MLLMs), but their high compute cost and limited video context hinder long-video grounding. We instead scale unified supervision while keeping the model lightweight. We present UniversalVTG, a single VTG model trained with large-scale cross-dataset pretraining. An offline Query Unifier canonicalizes heterogeneous query formats into a shared declarative space, reducing linguistic mismatch and preventing negative transfer under naïve joint training. Combined with an efficient grounding head, UniversalVTG scales to long, untrimmed videos. Across diverse benchmarks—GoalStep, Ego4D-NLQ, TACoS, Charades-STA, and ActivityNet-Captions—a single UniversalVTG checkpoint achieves state-of-the-art performance among dedicated VTG models. Despite being >100× smaller than recent MLLM-based approaches, it matches or exceeds their accuracy, offering a practical and efficient alternative for real-world deployment.

UniversalVTG Framework. A single, lightweight model generalizes across heterogeneous video domains and query styles. Left: Diverse videos are mapped into a shared representation via an efficient backbone, while a Query Unifier canonicalizes multi-style text inputs into a unified semantic space. Right: With both modalities standardized, a single grounding head localizes events across varied viewpoints (ego/exo), durations (short/long), and linguistic forms (e.g., questions, declarations). UniversalVTG achieves real-time inference (~10 ms/query), making it suitable for long-form video deployment.

Method

Universal without compromise: a single UNIVERSALVTG checkpoint rivals dataset-specific SOTA

UniversalVTG: A Univeral and Lightweight Foundation Model for Video Temporal Grounding

We introduce UniversalVTG, a universal and lightweight foundation model for video temporal grounding, trained with large-scale cross-dataset supervision and canonicalized textual inputs to enable strong generalization across heterogeneous domains and query formulations.

Interactive Demo

Abstract

Method

BibTeX