FusionSeg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos

Suyog Jain

Bo Xiong

Kristen Grauman


We propose an end-to-end learning framework for segmenting generic objects in videos. Our method learns to combine appearance and motion information to produce pixel level segmentation masks for all prominent objects in videos. We formulate this task as a structured prediction problem and design a two-stream fully convolutional neural network which fuses together motion and appearance in a unified framework. Since large-scale video datasets with pixel level segmentations are problematic, we show how to bootstrap weakly annotated videos together with existing image recognition datasets for training. Through experiments on three challenging video segmentation benchmarks, our method substantially improves the state-of-the-art for segmenting generic (unseen) objects.


Additional Video Segmentation Results


This research is supported in part by ONR YIP N00014-12-1-0754, an AWS Machine Learning Research Award, and the DARPA Lifelong Learning Machines project. This material is based on research sponsored by the Air Force Research Laboratory and DARPA under agreement number FA8750-18-2-0126. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The authors thank the reviewers for their suggestions.

The code and pre-trained models are freely available for research and academic purposes. However it's patent pending, so please contact us for any commercial use.