Progress-Aware Video Frame Captioning

CVPR, 2025

Zihui Xue, Joungbin An, Xitong Yang^*, Kristen Grauman

UT Austin

Video

A 4-minute silent video designed to supplement the paper

Overview

Image captioning: an isolated description for each image; no temporal context

Video captioning: a single description for the entire video clip; not temporally fine-grained

We propose the task of progress-aware video frame captioning, which aims to generate a sequence of captions that capture the temporal action dynamics within a video.

Challenges

Issues of existing VLMs:

(1) Lack of temporal granuarity (see row 2, Gemini's predictions for frames 2 & 3)

(2) Temporal hallucination (see row 1, GPT-4o's prediction for frame 2)

ProgressCaptioner Framework

ProgressCaptioner is designed in two stages. In Stage-I, we prepare frame pairs and generate corresponding caption pairs using multiple VLMs. Each pair undergoes our designed progression detection and caption matching evaluations, to decide if they are selected for model supervised fine-tuning or rejected, with the latter contributing to preference data to aid in model preference learning. The Stage-I model training then proceeds using this collected data. In Stage-II, the trained stage-I model labels frame sequences with a two-frame sliding window, in conjunction with other VLMs. These sequences are again assessed through progression detection and caption matching to classify them as selected or rejected. All collected data from both stages contribute to the final training of ProgressCaptioner.

Qualitative predictions

Red text identifies inaccuracies in the generated captions, while blue text highlights how our progress-aware captions build on prior content to clearly delineate what is changing or continuing.

(a)

(b)

(c)

(d)

(e)

(f)