AdaAlloc: Adaptive Visual Token Allocation for Long-Video Question Answering

The University of Texas at Austin

We propose AdaAlloc, a training-free long-video question answering framework that adaptively allocates fixed visual-token budgets between global temporal coverage and high-resolution local evidence through overview-conditioned planning, temporal grounding, and refined temporal memory, enabling precise and efficient reasoning over minute- to hour-long videos.

teaser

Visual token allocation for long-video reasoning. Uniform sampling preserves global coverage but loses fine details, while retrieval-based selection focuses on a few clips but can lose context. AdaAlloc allocates a fixed token budget between low-resolution global coverage and high-resolution local evidence for both context and fine-grained verification.

Abstract

Long-video question answering forces multimodal large language models to reason under strict visual-token budgets, creating a tradeoff between broad temporal coverage and fine-grained spatial detail. Uniform sampling preserves coverage but often obscures decisive local cues, while caption-based or frame-similarity retrieval operates over lossy proxies that can miss motion, state changes, and before-after relations. We recast long-video reasoning as a visual token allocation problem: given a fixed budget, decide how much to spend on global context versus high-resolution local evidence, and where to draw that local evidence from. We introduce AdaAlloc, a training-free inference framework that adaptively allocates visual tokens between a global video overview and localized visual detail. AdaAlloc first plans a global/local allocation policy, then uses temporal grounding to locate candidate evidence segments and refines them into a compact temporal memory. The final answerer reasons over the allocated global context and refined local evidence. Under matched backbones and visual-token budgets, AdaAlloc consistently improves over strong baselines on challenging long-video benchmarks, including LVBench, MLVU, VideoMME, and LongVideoBench.

Method

method figure

AdaAlloc Pipeline. AdaAlloc casts long-video QA as question-conditioned visual-token allocation. From a sparse overview, it selects a global/local budget policy, generates temporal grounding queries, refines retrieved candidates into a compact temporal memory, and constructs a hybrid input that preserves low-resolution global context while focusing high-resolution tokens on the most relevant evidence.

Qualitative Results

method figure

Qualitative Example on LVBench. AdaAlloc selects a Local-heavy allocation policy, retrieves multiple temporal candidates, and refines them into a compact memory that preserves the decisive chart interval. By combining low-resolution global context with high-resolution local evidence, it correctly identifies the highlighted number as 25, while Uniform predicts 10.

BibTeX

@article{an2026adalloc,
      title={AdaAlloc: Adaptive Visual Token Allocation for Long-Video Question Answering},
      author={An, Joungbin and Grauman, Kristen},
      journal={arXiv preprint arXiv:2605},
      year={2026}
    }