A 4-minute silent video designed to supplement the paper
Overview
Image captioning: an isolated description for each image; no temporal context
Video captioning: a single description for the entire video clip; not temporally fine-grained
We propose the task of progress-aware video frame captioning, which aims to generate a sequence of captions that capture
the temporal action dynamics within a video.
Challenges
Issues of existing VLMs:
(1) Lack of temporal granuarity (see row 2, Gemini's predictions for frames 2 & 3)
(2) Temporal hallucination (see row 1, GPT-4o's prediction for frame 2)
ProgressCaptioner Framework
ProgressCaptioner is designed in two stages. In Stage-I, we prepare frame pairs and generate corresponding caption
pairs using multiple VLMs. Each pair undergoes our designed progression detection and caption matching evaluations, to decide if they
are selected for model supervised fine-tuning or rejected, with the latter contributing to preference data to aid in model preference learning.
The Stage-I model training then proceeds using this collected data. In Stage-II, the trained stage-I model labels frame sequences with a
two-frame sliding window, in conjunction with other VLMs. These sequences are again assessed through progression detection and caption
matching to classify them as selected or rejected. All collected data from both stages contribute to the final training of ProgressCaptioner.
Qualitative predictions
Red text identifies inaccuracies in the generated captions,
while blue text highlights how our progress-aware captions build on prior content to clearly delineate what is changing or continuing.