HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling

The University of Texas at Austin

We propose HieraMamba, a hierarchical video grounding model that preserves temporal fidelity across scales through anchor-based Mamba pooling and dual contrastive losses, enabling precise and efficient localization of natural language queries in hour-long videos.

teaser

HieraMamba enables hierarchical, linear-time temporal grounding in long untrimmed videos. Top row: a cooking clip and two queries—Q1 spans a long interval, Q2 a very short one. Middle left: uniform down-sampling (gray squares) drops frames and loses evidence for the queries. Bottom left: fixed sliding windows split segments at window boundaries but are susceptible to fragmenting (red dashed lines). Right: HieraMamba builds on our stacked Anchor-MambaPooling (AMP) blocks to construct a multi-scale temporal hierarchy for precise, query-specific localization across levels. For example, the brief ‘stove on’ moment in Q2 is captured by fine-scale embeddings in the first layer, while Q1’s broader context (‘prepping ingredients’) is naturally represented by the longer, coarser embeddings at the top layer.

Abstract

Video temporal grounding, the task of localizing the start and end times of a natural language query in untrimmed video, requires capturing both global context and fine temporal detail. This challenge is particularly pronounced in long videos, where existing methods often compromise temporal fidelity by over-downsampling or relying on fixed windows. We present HieraMamba, a hierarchical architecture that preserves temporal structure and semantic richness across scales. At its core are Anchor-MambaPooling (AMP) blocks, which leverage Mamba’s selective scanning to produce compact anchor tokens that summarize video content at multiple granularities. Two complementary objectives, anchor-conditioned and segment-pooled contrastive losses, encourage anchors to retain local detail while remaining globally discriminative. HieraMamba sets a new state-of-the-art on Ego4D-NLQ, MAD, and TACoS, demonstrating precise, temporally faithful localization in long, untrimmed videos.

Method

method figure

Overview of the HieraMamba Architecture. (a) Frozen backbones extract video clip and text token features. The hierarchical video encoder, a stack of L AMP blocks, builds a multi-scale pyramid Vpyr, which is fused with text features and decoded to predict timestamps. (b) Each AMP block receives anchors from the previous layer (A(l)), interleaves them with new compressed anchors (A(l+1)), applies a bidirectional Mamba scan for global context, and refines local details. The block outputs refined tokens ((l)) and downsampled anchors (A(l+1)) fed to the next block. Repeating this L times and collecting the refined outputs {(l)} L−1 l=0 forms the multi-scale hierarchy Vpyr. (c) Two contrastive losses guide training. The self-supervised Anchor-Conditioned Contrastive (ACC) loss enforces hierarchy consistency by pulling anchors toward their constituent frames and pushing from distant anchors. The supervised Segment-Pooled Contrastive (SPC) loss provides semantic alignment between ground-truth segments and surrounding context. Together, they yield compact, distinctive, and query-aligned anchors.

Efficiency and Scalability

method figure

Accuracy-Compute Trade-off. We plot average recall on the MAD-v2 eval set against computational cost (FLOPs), with FLOPs measured for a single forward pass on a sequence simulating the ~100 minute average video duration of the MAD dataset. Hieramamba achieves state-of-the-art accuracy with significantly lower computational cost than previous methods. SnAG (Local) is the default configuration using local self-attention window. SnAG (Global) is a variant with full non-local self-attention.

Qualitative Results

method figure

Qualitative Results. Qualitative visualization of queries, ground truth, and our predictions. A single video can contain queries that require grounding short, medium, or long tempoal spans, necessitating flexible reasoning at different scales. HieraMamba, with its rich multi-scale semantics, effectively adapts to these varying granularities

BibTeX

@inproceedings{an2025hieramamba,
    title = {HieraMamba: Video Temporal Grounding via Hierarchical Anchor-Mamba Pooling},
    author = {Joungbin An and Kristen Grauman},
    year = {2025},
    booktitle = {arxiv},
    }