Seeing the Arrow of Time in Large Multimodal Models

Video

A 6-minute silent video designed to supplement the paper

Arrow of Time (AoT) perception challenges for large multimodal models (LMMs)

Basic visual directionality

Question: Which video is played in reverse?

While intuitive for humans, today's LMMs struggle to distinguish between the two videos (see first two columns of Table 1 in our paper for quantitative results).

Temporal insensitivity

On 8 popular video question & answering (VQA) benchmarks, when processing forward, shuffled, and reversed video frame sequences, LMM performance has little to no change.

Key Idea: empower LMMs to see the arrow of time

(1) We propose ArrowRL to enhance LMMs with temporal awareness.

(2) We introduce a new benchmark, AoTBench, to assess LMMs' AoT perception.

Base LMM (Qwen2.5-VL-7B): A gas stove burner ignites and produces a blue flame.

Base LMM + ArrowRL (ours): A gas stove burner ignites and produces a blue flame.

Base LMM (Qwen2.5-VL-7B): The gas stove burner is ignited and producing a steady blue flame.

Base LMM + ArrowRL (ours): The video shows a gas stove burner being turned off, with the blue flame gradually extinguishing.

ArrowRL

We propose ArrowRL, a post-training RL algorithm based on Group Relative Policy Optimization (GRPO) to instill AoT awareness into LMMs. The core idea is a reverse reward that promotes divergence between the model's forward and backward video interpretations, fostering AoT sensitivity for temporally demanding questions.

AoTBench

We propose AoTBench, the first dedicated benchmark to assess temporal direction sensitivity—a core component of robust video perception—through three distinct elements. Explore our data here.

Qualitative examples

Question:: What is the order of the letters on the table at the end?

Base LMM (Qwen2.5-VL-7B): storm

Base LMM + ArrowRL (ours): tmrso

Question:: Which caption best describes this video?

Base LMM (Qwen2-VL-7B): the cards are shuffled before they are distributed.

Base LMM + ArrowRL (ours): the cards are distributed before they are shuffled.