SPOC: Spatially-Progressing Object State Change Segmentation in Video

arXiv, 2025


UT Austin

Video

A 5-minute silent video supplementing the paper



Overview


Object state changes (OSC) in video reveal critical information about human and agent activity. However, existing methods are limited to global high-level tasks.

(1) Frame-level state-change classification: Classify each frame in video as initial, transitioning, end state (e.g. ChangeIt, HowToChange). Lack fine-grained object information ☹

(2) State-agnostic object segmentation: Segment the entire object undergoing state-change (e.g. VOST, VSCOS). State-agnostic i.e. no fine-grained information on which state the object is in ☹


We introduce the spatially-progressing object state change segmentation (SPOC) task. The goal is to segment at the pixel-level those regions of an object that are actionable and those that are transformed (top row). A diverse set of spatially-progressing segmentations of different state-change activities is shown in the bottom row.


We propose the first model to address this task, designing a VLM-based pseudo-labeling approach, state-change dynamics constraints, and a novel WhereToChange benchmark built on in-the-wild Internet videos. We further demonstrate useful implications for tracking activity progress to benefit robotic agents.


WhereToChange Dataset


We introduce WhereToChange, a large-scale dataset featuring detailed intra-object state-change segmentations across a wide variety of objects and actions. We focus on 10 spatially-progressing state-change activities spanning 116 diverse objects and 232 unique OSCs. The training set comprises 17k vidoe clips pseudo-labeled using a custom pipeline. The human-annotated evaluation set consists of 1162 video clips, samples of which are shown below.





SPOC Framework



SPOC training proceeds with first generating large-scale pseudo-labeled training data. Given a video of a human performing a state-changing activity, we use off-the-shelf object detection, mask generation, and tracking models to extract a set of region mask proposals for each frame. We then use CLIP to apply similarity-score matching of visual region embeddings with textual state-description embeddings to obtain max-similarity pseudo-labels for each region. We refine the pseudo-labels by incorporating several important OSC dynamics constraints that emphasize the temporal progression of state-change transitions while respecting their causal dynamics. Finally, we train a video encoder-decoder transformer model to predict actionable/transformed labels for each region mask proposal.



Qualitative Predictions




a) SPOC clearly distinguishes between actionable and transformed instances of the state-changing object (coated vs uncoated apple), b) with the ability to generalize to novel unseen objects (slicing lime). In contrast, baseline methods tend to be state-change agnostic with decreased ability to disambiguate object states. c) SPOC also shows good generalization to the challenging out-of-distribution VOST dataset. d) Failure cases arise from singular mask proposals spanning the entire object during transitions (single mask for the full lettuce), affecting the model’s intra-object segmentation capability.



Activity Progress Monitoring




We show sample frames from a video sequence with progress curves generated by different methods, where vertical lines indicate the time-steps of sampled frames. Ideal curves should decrease monotonically, and saturate upon reaching the end state. In contrast to goal-based representation learning methods such as VIP and LIV, OSC-based curves accurately track task progress, making them valuable for downstream applications like progress monitoring and robot learning.

BibTeX

@inproceedings{mandikal2025spoc,
      title={SPOC: Spatially-Progressing Object State Change Segmentation in Video},
      author={Mandikal, Priyanka and Nagarajan, Tushar and Stoken, Alex and Xue, Zihui and Grauman, Kristen},
      booktitle={ArXiv},
      year={2025}
  }