workshop graphic

Themes

Call for Abstracts

Important Dates

Speakers

Program

Accepted Papers

Program committee

Organizers

Sponsors

deepmind logo

Recorded videos: morning session, panel discussion (we forgot the afternoon session, sorry!).

In the only context in which we have observed intelligence, namely the biological context, the emergence of intelligence and superior visual abilities in different families of animals has each time been closely tied to the emergence of the ability to move and act in their environments [Moravec'84]. Cognitive scientists have also empirically verified that self-generated motions are critical to the development of visual perceptual skills in animals [Held'63], and a sizeable research program in cognitive science studies "embodied cognition" --- the hypothesis that cognition is strongly influenced by aspects of an agent's body beyond the brain itself [Wilson'02].

Progress in standard visual recognition tasks in the last few years has been largely fueled by access to today's largest painstakingly curated and hand-labeled datasets. Images sampled independently from the web are manually assigned to one of several categories by thousands of human workers to create these datasets. This cumbersome process may be replaceable. Specifically, an agent continuously acting, moving and monitoring its environment has available to it many avenues of knowledge, going well beyond what can be learned from observing only orderless, i.i.d. ``bags of images'' with category labels like today's standard datasets. For instance, such an agent may exploit ordered image sequences i.e., its observed video stream, with freely available image \leftrightarrow image or image \leftrightarrow other sensor relationships. Such an agent can use the observed results of those actions as a form of self-acquired supervision. For instance, it may tap an object to determine its material properties ("action"), walk around it to observe a less ambiguous view of it ("motion"), or learn natural world physics by dropping, pushing or throwing objects and learning to anticipate their behavior ("anticipation"). These forms of supervision may allow agents to discover knowledge not available through the standard supervised paradigm.

Moreover, alleviating the non-scalable curation and labeling requirements involved in compiling today's standard datasets is a worthwhile goal in itself, so that all or most visual learning may happen without manual supervision. Replacing manual supervision with alternative forms of supervision in this manner would have many advantages: (1) it would open up the possibility of exploiting much larger datasets for visual learning. This could potentially drive even better-performing computer vision systems for conventional tasks, since the evidence over the last few years has suggested that visual learning benefits from ever-higher capacity models trained on ever-larger datasets, (2) it would enable easy development of visual applications for more narrow, non-standard domains for which large labeled datasets neither currently exist, nor are likely to be curated in the future, such as say, vision for an inter-planetary rover, and (3) compared to standard supervised learning, it more closely resembles what we know about the avenues of learning available to the biological visual systems we hope to ultimately emulate in performance, even if not in design.

In this workshop, we aim to focus on how action, motion and anticipation may all offer viable and important means for visual learning. Several closely intertwined emerging research directions touching on our theme are being concurrently and largely independently explored by researchers in the vision, machine learning, and robotics communities around the world. A major goal of our workshop will be to bring these researchers together, provide a forum to foster collaborations and exchange of ideas, and ultimately, to help advance research along these directions.



Call for Abstracts

We had invited 2-page abstracts describing relevant work that has been recently published, is in progress, or is to be presented at ECCV. Review of the submissions was double blind. While there will be no formal proceedings, accepted abstracts are posted here. Authors of accepted abstracts will present their work in a poster session at the workshop, and as short spotlight talks. We encouraged submissions from not only the vision community, but also from machine learning, robotics, cognitive science and other related disciplines. A representative, but not exhaustive list of topics of interest is shown below:

The LaTeX template for submission is posted here. Abstracts must be no longer than 2 pages (including references), and submitted on or before August 3, 11.59 p.m. US Central Time. Submission will be via the workshop CMT. (Submission is now closed.)



Important Dates

August 3 Abstracts due (closed)
August 19Reviews due
August 29Decision notifications to authors
September 12Final abstracts due
October 9 (tentatively 9 a.m. to 5 p.m.) Workshop date



Speakers

We have invited researchers from across different disciplines to bring their perspectives to the workshop. Here is our speaker list:

Abhinav Gupta
Jitendra Malik
Honglak Lee
Abhinav Gupta
CMU
Jitendra Malik
UC Berkeley
Honglak Lee
U Michigan
Ali Farhadi
Jeanette Bohg
Nando de Freitas
Ali Farhadi
UW
Jeanette Bohg
MPI Tübingen
Nando de Freitas
Oxford


Program

Recorded videos: morning session, panel discussion (we forgot the afternoon session, sorry!).

The workshop will be held at Oudemanhuispoort. Here is our program.

09.00 a.m. Opening remarks
09.15 a.m. Invited Speaker 1: Jitendra Malik: "Acquiring mental models through perception, simulation and action."
09.45 a.m. Invited Speaker 2: Nando de Freitas: "Make learning off-policy again! Sample efficient actor-critic with experience replay."
10.15 a.m. Invited Speaker 3: Jeanette Bohg: "Interactive Perception for Perceptive Manipulation - or, how putting perception on a physical system changes everything."
10.45 a.m. Coffee break
11.00 a.m. Poster spotlights
11.45 a.m. Lunch break
01.00 p.m. Poster session
02.00 p.m. Invited Speaker 4: Abhinav Gupta: "Scaling Self-supervision: From one task, one robot to multiple tasks and robots "
02.30 p.m. Invited Speaker 5: Honglak Lee: "Learning Disentangled Representations for Prediction and Anticipation"
03.00 p.m. Invited Speaker 6: Ali Farhadi: "Towards Crowifying Vision"
03.30 p.m. Coffee break
03.45 p.m. Panel discussion with invited speakers


Accepted Papers

The following papers will be presented at the workshop as spotlights and posters:



Program committee

We thank our program committee for kindly donating their time and expertise towards reviewing workshop submissions:

Andrew Owens, MIT Aron Monszpart, UCL Bharath Sankaran, USC
Carl Vondrick, MIT Chelsea Finn, UC Berkeley Cornelia Fermuller, UMD
David Vernon, University of Genoa Honglak Lee, University of Michigan Jiajun Wu, MIT
John Tsotsos, York University Katerina Fragkiadaki, Google Research Lerrel Pinto, CMU
Lorenzo Natale, Italian Institute of Technology Pulkit Agrawal, UC Berkeley Raia Hadsell, Google Deepmind
Ruzena Bajcsy, UC Berkeley Swapnaa Jayaraman, Indiana University Xiaolong Wang, CMU
Yiannis Aloimonos, UMD Serena Yeung, Stanford Jeff Donahue, UC Berkeley
Omar Florez, Intel Labs


Organizers

Dinesh Jayaraman
Kristen Grauman
Sergey Levine
Dinesh Jayaraman
UT Austin
Kristen Grauman
UT Austin
Sergey Levine
UW



Sponsors

Please email Dinesh (dineshj [at] cs [dot] utexas [dot] edu) for information about sponsorship.