Abstract

Understanding how images of objects and scenes behave in response to specific ego-motions is a crucial aspect of proper visual development, yet existing visual learning methods are conspicuously disconnected from the physical source of their images. We propose to exploit proprioceptive motor signals to provide unsupervised regularization in convolutional neural networks to learn visual representations from egocentric video. Specifically, we enforce that our learned features exhibit equivariance i.e. they respond predictably to transformations associated with distinct ego-motions. With three datasets, we show that our unsupervised feature learning approach significantly outperforms previous approaches on visual recognition and next-best-view prediction tasks. In the most challenging test, we show that features learned from video captured on an autonomous driving platform improve large-scale scene recognition in static images from a disjoint domain.

Bibtex

@inproceedings{jayaraman-iccv2015,
author = {D. Jayaraman and K. Grauman},
title = {{Learning image representations tied to egomotion}},
booktitle = {ICCV},
year = {2015}
}

227x227 image models

Since the time of the ICCV paper, we have scaled up the models to handle 227x227 images. Weights and prototxt are shared below.

More details are shared in our IJCV version here. Aside from the results in the paper, here are some results benchmarking the performance of the shared model on some recognition tasks.

Purely unsupervised features for recognition tasks (pool5 features + linear classifier)

Method	PASCAL VOC 07	PASCAL VOC 12	MIT indoor scenes
Egomotion-pretraining (Agrawal et al, ICCV 15)	0.25	0.37	0.38
Egomotion-pretraining (Ours)	0.26	0.39	0.40

Finetuning unsupervised features for recognition tasks

Method	PASCAL VOC 07	PASCAL VOC 12
Egomotion-pretraining (Agrawal et al, ICCV 15)	42.4	40.2
Egomotion-pretraining (Ours)	41.7	40.7

Downloads

Paper

Supplementary material

2 pg abstract (Object Understanding and Interaction Workshop)

ICCV 2015 video spotlight

ICCV 2015 oral slides

ICCV 2015 poster

32x32 original ICCV15 paper KITTI experiments caffe net prototxt
32x32 original ICCV15 paper KITTI experiments caffe solver prototxt

227x227 KITTI-pretrained Alexnet-based architecture weights
227x227 KITTI-pretrained Alexnet-based architecture deploy prototxt

Learning image representations tied to egomotion

Dinesh Jayaraman and Kristen Grauman

dineshj[at]cs[dot]utexas[dot]edu, grauman[at]cs[dot]utexas[dot]edu