A 5-minute video (with audio) supplementing the paper
Overview
The dominant paradigm in robotic manipulation today heavily focuses on rigid-body motion (e.g., pick-and-place, open-and-close, pour-and-rotate, etc). However, a wide range of real-world human manipulation involves object state changes—such as mashing, spreading, or slicing —where an object’s physical and visual state evolve progressively over time, often in an irreversible way.
We introduce a unified vision-based approach to capture these fine-grained, spatially-progressing transformations, successfully demonstrating how to guide real robot manipulation for this family of tasks.
SPARTA Framework
At each episode step, our policy takes the current and past SPOC visual-affordance (segmentation) maps as inputs , along with the robot arm’s proprioception data and predicts a displacement action for the arm’s end-effector.
SPARTA supports two robot policy variants:
(a) SPARTA-L (Learning): a reinforcement learning agent trained using a dense reward that measures the progressive change of object regions from actionable (red) to transformed (green);
(b) SPARTA-G (Greedy): selects among 8 discrete directions based on the local density of actionable pixels, producing a fast, greedy policy guided by visual progress
Tasks
SPARTA is tested on two different object transformation tasks across 10 diverse real-world objects
Results
SPARTA decisively beats sparse and dense goal-conditioned baselines, with trained RL policies surpassing greedy control in complex, fine-precision tasks.
Reward Curves
Below we show reward curves for the bread-spreading task
a) Cumulative episode reward curves: SPARTA produces smooth, incremental rewards aligned with visual progress, while LIV rewards remain unstable throughout the episode, offering poor guidance.
b) Training curves: stable, dense feedback drives sample-efficient learning, with SPARTA rapidly improving while SPARSE and LIV stagnate.
BibTeX
@inproceedings{mandikal2025sparta,
title={Mash, Spread, Slice! Learning to Manipulate Object States via Visual Spatial Progress},
author={Mandikal, Priyanka and Hu, Jiaheng and Dass, Shivin and Majumder, Sagnik and Martín-Martín, Roberto and Grauman, Kristen},
booktitle={ArXiv},
year={2025}
}