|
We introduce the task of early mistake detection in video,
where the goal is to determine whether a keystep in a procedural activity
is performed correctly while observing as little of the streaming video as
possible. To tackle this problem, we propose a method comprising a mistake detector and a reinforcement learning policy. At each timestep, the
detector processes recently observed frames to estimate the keystep's correctness while anticipating future visual features, enabling reliable early
mistake estimates. Meanwhile, the policy aggregates the detector outputs
and visual observations over time and adaptively decides when to exit (i.e.,
stop processing incoming frames) while producing the final prediction.
Using diverse real-world procedural video datasets, we demonstrate that
our MistExit model achieves superior mistake detection accuracy while reducing
the fraction of video observed compared to state-of-the-art models.
|