Mash, Spread, Slice! Learning to Manipulate
Object States via Visual Spatial Progress

Preprint, 2025


Demo 1
Demo 2
Demo 3
Demo 4
Demo 5
Demo 6
Demo 7
Demo 8

Video

A 5-minute video (with audio) supplementing the paper



Overview


The dominant paradigm in robotic manipulation today heavily focuses on rigid-body motion (e.g., pick-and-place, open-and-close, pour-and-rotate, etc). However, a wide range of real-world human manipulation involves object state changes—such as mashing, spreading, or slicing —where an object’s physical and visual state evolve progressively over time, often in an irreversible way.

We introduce a unified vision-based approach to capture these fine-grained, spatially-progressing transformations, successfully demonstrating how to guide real robot manipulation for this family of tasks.




SPARTA Framework



At each episode step, our policy takes the current and past SPOC visual-affordance (segmentation) maps as inputs , along with the robot arm’s proprioception data and predicts a displacement action for the arm’s end-effector.

SPARTA supports two robot policy variants:

(a) SPARTA-L (Learning): a reinforcement learning agent trained using a dense reward that measures the progressive change of object regions from actionable (red) to transformed (green);

(b) SPARTA-G (Greedy): selects among 8 discrete directions based on the local density of actionable pixels, producing a fast, greedy policy guided by visual progress



Tasks


SPARTA is tested on two different object transformation tasks across 10 diverse real-world objects




Results




SPARTA decisively beats sparse and dense goal-conditioned baselines, with trained RL policies surpassing greedy control in complex, fine-precision tasks.



Reward Curves


Below we show reward curves for the bread-spreading task



a) Cumulative episode reward curves: SPARTA produces smooth, incremental rewards aligned with visual progress, while LIV rewards remain unstable throughout the episode, offering poor guidance.

b) Training curves: stable, dense feedback drives sample-efficient learning, with SPARTA rapidly improving while SPARSE and LIV stagnate.

BibTeX

@inproceedings{mandikal2025sparta,
      title={Mash, Spread, Slice! Learning to Manipulate Object States via Visual Spatial Progress},
      author={Mandikal, Priyanka and Hu, Jiaheng and Dass, Shivin and Majumder, Sagnik and Martín-Martín, Roberto and Grauman, Kristen},
      booktitle={ArXiv},
      year={2025}
  }