Long-video question answering forces multimodal large language models to reason under strict visual-token budgets, creating a tradeoff between broad temporal coverage and fine-grained spatial detail. Uniform sampling preserves coverage but often obscures decisive local cues, while caption-based or frame-similarity retrieval operates over lossy proxies that can miss motion, state changes, and before-after relations. We recast long-video reasoning as a visual token allocation problem: given a fixed budget, decide how much to spend on global context versus high-resolution local evidence, and where to draw that local evidence from. We introduce AdaAlloc, a training-free inference framework that adaptively allocates visual tokens between a global video overview and localized visual detail. AdaAlloc first plans a global/local allocation policy, then uses temporal grounding to locate candidate evidence segments and refines them into a compact temporal memory. The final answerer reasons over the allocated global context and refined local evidence. Under matched backbones and visual-token budgets, AdaAlloc consistently improves over strong baselines on challenging long-video benchmarks, including LVBench, MLVU, VideoMME, and LongVideoBench.
@article{an2026adalloc,
title={AdaAlloc: Adaptive Visual Token Allocation for Long-Video Question Answering},
author={An, Joungbin and Grauman, Kristen},
journal={arXiv preprint arXiv:2605},
year={2026}
}