SPOC: Spatially-Progressing Object State Change Segmentation in Video

arXiv, 2025

Priyanka Mandikal, Tushar Nagarajan, Alex Stoken, Zihui Xue, Kristen Grauman

UT Austin

PDF arXiv

Video

A 5-minute silent video supplementing the paper

Overview

Object state changes (OSC) in video reveal critical information about human and agent activity. However, existing methods are limited to global high-level tasks.

(1) Frame-level state-change classification: Classify each frame in video as initial, transitioning, end state (e.g. ChangeIt, HowToChange). Lack fine-grained object information ☹

(2) State-agnostic object segmentation: Segment the entire object undergoing state-change (e.g. VOST, VSCOS). State-agnostic i.e. no fine-grained information on which state the object is in ☹

We introduce the spatially-progressing object state change segmentation (SPOC) task. The goal is to segment at the pixel-level those regions of an object that are actionable and those that are transformed (top row). A diverse set of spatially-progressing segmentations of different state-change activities is shown in the bottom row.

We propose the first model to address this task, designing a VLM-based pseudo-labeling approach, state-change dynamics constraints, and a novel WhereToChange benchmark built on in-the-wild Internet videos. We further demonstrate useful implications for tracking activity progress to benefit robotic agents.

WhereToChange Dataset

We introduce WhereToChange, a large-scale dataset featuring detailed intra-object state-change segmentations across a wide variety of objects and actions. We focus on 10 spatially-progressing state-change activities spanning 116 diverse objects and 232 unique OSCs. The training set comprises 17k vidoe clips pseudo-labeled using a custom pipeline. The human-annotated evaluation set consists of 1162 video clips, samples of which are shown below.

chopping avocado

coating apple

crushing potato

grating butter

mashing banana

melting butter

mincing jalapeno

peeling pineapple

shredding chicken

slicing cabbage

SPOC Framework

SPOC training proceeds with first generating large-scale pseudo-labeled training data. Given a video of a human performing a state-changing activity, we use off-the-shelf object detection, mask generation, and tracking models to extract a set of region mask proposals for each frame. We then use CLIP to apply similarity-score matching of visual region embeddings with textual state-description embeddings to obtain max-similarity pseudo-labels for each region. We refine the pseudo-labels by incorporating several important OSC dynamics constraints that emphasize the temporal progression of state-change transitions while respecting their causal dynamics. Finally, we train a video encoder-decoder transformer model to predict actionable/transformed labels for each region mask proposal.

Qualitative Predictions

a) SPOC clearly distinguishes between actionable and transformed instances of the state-changing object (coated vs uncoated apple), b) with the ability to generalize to novel unseen objects (slicing lime). In contrast, baseline methods tend to be state-change agnostic with decreased ability to disambiguate object states. c) SPOC also shows good generalization to the challenging out-of-distribution VOST dataset. d) Failure cases arise from singular mask proposals spanning the entire object during transitions (single mask for the full lettuce), affecting the model’s intra-object segmentation capability.

Activity Progress Monitoring

We show sample frames from a video sequence with progress curves generated by different methods, where vertical lines indicate the time-steps of sampled frames. Ideal curves should decrease monotonically, and saturate upon reaching the end state. In contrast to goal-based representation learning methods such as VIP and LIV, OSC-based curves accurately track task progress, making them valuable for downstream applications like progress monitoring and robot learning.

BibTeX

@inproceedings{mandikal2025spoc,
      title={SPOC: Spatially-Progressing Object State Change Segmentation in Video},
      author={Mandikal, Priyanka and Nagarajan, Tushar and Stoken, Alex and Xue, Zihui and Grauman, Kristen},
      booktitle={ArXiv},
      year={2025}
  }