Video temporal grounding (VTG) is often tackled with dataset-specific models that transfer poorly
across
domains and query styles. Recent efforts adapt large multimodal language models (MLLMs), but their
high
compute cost and limited video context hinder long-video grounding.
We instead scale unified supervision while keeping the model lightweight. We present
UniversalVTG, a single VTG model trained with large-scale cross-dataset
pretraining. An
offline Query Unifier canonicalizes heterogeneous query formats into a shared declarative
space,
reducing linguistic mismatch and preventing negative transfer under naïve joint training.
Combined with an efficient grounding head, UniversalVTG scales to long, untrimmed videos. Across
diverse
benchmarks—GoalStep, Ego4D-NLQ, TACoS, Charades-STA, and ActivityNet-Captions—a single UniversalVTG
checkpoint achieves state-of-the-art performance among dedicated VTG models. Despite being >100×
smaller than recent MLLM-based approaches, it matches or exceeds their accuracy, offering a
practical and
efficient alternative for real-world deployment.