Object state changes (OSC) in video reveal critical information about human and agent activity. However, existing methods are limited to global high-level tasks.
(1) Frame-level state-change classification: Classify each frame in video as initial, transitioning, end state (e.g. ChangeIt, HowToChange). Lack fine-grained object information ☹
(2) State-agnostic object segmentation: Segment the entire object undergoing state-change (e.g. VOST, VSCOS). State-agnostic i.e. no fine-grained information on which state the object is in ☹
We introduce the spatially-progressing object state change segmentation (SPOC) task. The goal is to segment at the pixel-level those regions of an object that are actionable and those that are transformed (top row). A diverse set of spatially-progressing segmentations of different state-change activities is shown in the bottom row.
We propose the first model to address this task, designing a VLM-based pseudo-labeling approach, state-change dynamics constraints, and a novel WhereToChange benchmark built on in-the-wild Internet videos. We further demonstrate useful implications for tracking activity progress to benefit robotic agents.
WhereToChange Dataset
We introduce WhereToChange, a large-scale dataset featuring detailed intra-object state-change segmentations across a wide variety of objects and actions. We focus on 10 spatially-progressing state-change activities spanning 116 diverse objects and 232 unique OSCs. The training set comprises 17k vidoe clips pseudo-labeled using a custom pipeline. The human-annotated evaluation set consists of 1162 video clips, samples of which are shown below.
chopping avocado
coating apple
crushing potato
grating butter
mashing banana
melting butter
mincing jalapeno
peeling pineapple
shredding chicken
slicing cabbage
SPOC Framework
SPOC training proceeds with first generating large-scale pseudo-labeled training data. Given a video of a human performing a state-changing activity, we use off-the-shelf object detection, mask generation, and tracking models to extract a set of region mask proposals for each frame. We then use CLIP to apply similarity-score matching of visual region embeddings with textual state-description embeddings to obtain max-similarity pseudo-labels for each region. We refine the pseudo-labels by incorporating several important OSC dynamics constraints that emphasize the temporal progression of state-change transitions while respecting their causal dynamics. Finally, we train a video encoder-decoder transformer model to predict actionable/transformed labels for each region mask proposal.
Qualitative Predictions
a) SPOC clearly distinguishes between actionable and transformed instances of the state-changing object (coated vs uncoated apple), b) with the ability to generalize to novel unseen objects (slicing lime). In contrast, baseline methods tend to be state-change agnostic with decreased ability to disambiguate object states. c) SPOC also shows good generalization to the challenging out-of-distribution VOST dataset. d) Failure cases arise from singular mask proposals spanning the entire object during transitions (single mask for the full lettuce), affecting the model’s intra-object segmentation capability.
Activity Progress Monitoring
We show sample frames from a video sequence with progress curves generated by different methods, where vertical lines indicate the time-steps of sampled frames. Ideal curves should decrease monotonically, and saturate upon reaching the end state. In contrast to goal-based representation learning methods such as VIP and LIV, OSC-based curves accurately track task progress, making them valuable for downstream applications like progress monitoring and robot learning.
BibTeX
@inproceedings{mandikal2025spoc,
title={SPOC: Spatially-Progressing Object State Change Segmentation in Video},
author={Mandikal, Priyanka and Nagarajan, Tushar and Stoken, Alex and Xue, Zihui and Grauman, Kristen},
booktitle={ArXiv},
year={2025}
}