CS381V: Visual Recognition, Spring 2016

Course overview Syllabus Detailed schedule Piazza

Meets: Wednesdays 1-4 pm in GDC 2.502

Instructor: Kristen Grauman
Office: GDC 4.726
Office hours: by appointment (send email)

TA: Kai-Yang Chiang
Office: GDC 4.802D
Office hours: Thursday 10:30 am-12:30 pm

Please use Piazza for assignment questions.

Course overview:

Topic: This is a graduate seminar course in computer vision. We will survey and discuss current vision papers relating to visual recognition (primarily of objects and object categories), auto-annotation of images, and scene understanding. The goals of the course will be to understand current approaches to some important problems, to actively analyze their strengths and weaknesses, and to identify interesting open questions and possible directions for future research.

See the syllabus for an outline of the main topics we'll be covering.

Requirements: Students will be responsible for:

writing two paper reviews each week

posting two short review summaries/discussion points on the course discussion board (Piazza)

participating in discussions during class

completing two programming assignments

presenting ~twice in class (details depending on final enrollment)

completing a project with a partner

Note that presentations are due one week before the slot your presentation is scheduled. This means you will need to read the papers, prepare experiments, create slides, etc. more than one week before the date you are signed up for. The idea is to meet and discuss ahead of time, so that we can iterate as needed the week leading up to your presentation. Please coordinate in advance with the other student presenters on your day to ensure that no single paper receives 2 experiments or 2 paper presentations.

More details on the requirements and grading breakdown are here.

Prereqs: Courses in computer vision and/or machine learning (CS 376 Computer Vision and/or CS 391 Machine Learning, or similar); ability to understand and analyze conference papers in this area; programming required for experiment presentations and projects.

Please talk to me if you are unsure if the course is a good match for your background. I generally recommend scanning through a few papers on the syllabus to gauge what kind of background is expected. I don't assume you are already familiar with every single algorithm/tool/image feature a given paper mentions, but you should feel comfortable following the key ideas.

Syllabus overview:

Instance recognition

Category recognition

Mid-level representations

Object detection

Attributes and parts

Language and vision

Low-supervision learning

Great outdoors

3d scenes and objects

Recognition in action

Noticing and remembering

Social signals

Important dates:

Wednesday, Jan 27: paper topic preferences due

Friday, Feb 19: first coding assignment due

Friday Mar 4: second coding assignment due

Wednesday, Mar 23: project proposal due

Wednesday, April 13: project draft due

Wednesday May 4: poster session in GDC 6.516, 1-4 pm
Friday May 6: final papers due

Schedule and papers:

Date	Topics	Papers and links	Presenters/slides	Items due
Jan 20	Course intro		slides
Jan 27	No class			Topic preferences due via email to Kai by Wed Jan 27. Write "CS381V" in the subject line.
Feb 3	Instance recognition Invariant local features, local feature matching, instance recognition, visual vocabularies and bag-of-words, large-scale mining image credit: Andrea Vedaldi and Andrew Zisserman	Object Recognition from Local Scale-Invariant Features, Lowe, ICCV 1999. [pdf] [code] [other implementations of SIFT] [IJCV] Local Invariant Feature Detectors: A Survey, Tuytelaars and Mikolajczyk. Foundations and Trends in Computer Graphics and Vision, 2008. [pdf] [Oxford code] [Selected pages -- read pp. 178-188, 216-220, 254-255] *Video Google: A Text Retrieval Approach to Object Matching in Videos, Sivic and Zisserman, ICCV 2003. [pdf] [demo] For more background on feature extraction: Szeliski book: Sec 3.2 Linear filtering, 4.1 Points and patches, 4.2 Edges Oxford group interest point software Andrea Vedaldi's VLFeat code, including SIFT, MSER, hierarchical k-means. INRIA LEAR team's software, including interest points, shape features FLANN - Fast Library for Approximate Nearest Neighbors. Marius Muja et al. Google Goggles Kooaba Code for downloading Flickr images, by James Hays UW Community Photo Collections homepage INRIA Holiday images dataset NUS-WIDE tagged image dataset of 269K images MIRFlickr dataset	slides outline	Coding assignment 1 out, due Friday Feb 19.
Feb 10	Category recognition Image descriptors, classifiers, support vector machines, nearest neighbors, convolutional neural networks, large-scale image collections Image credit: ImageNet	ImageNet Large Scale Visual Recognition Challenge. Russakovsky et al. IJCV 2015. [pdf] ImageNet Classification with Deep Convolutional Neural Networks. A. Krizhevsky, I. Sutskever, and G. Hinton. NIPS 2012 [pdf] Beyond Bags of Features: Spatial Pyramid Matching for Recognizing Natural Scene Categories, Lazebnik, Schmid, and Ponce, CVPR 2006. [pdf] [15 scenes dataset] [libpmk] [Matlab] 80 Million tiny images: a large dataset for non-parametric object and scene recognition. A. Torralba, R. Fergus, and W. Freeman. PAMI 2008. [pdf] CNN resources linked from Stanford CS231n course page CNN/NN open source implementations Caffe Torch Theano Pylearn2 TensorFlow cuda-convnet ConvNets for Visual Recognition course, Andrej Karpathy, Stanford Machine learning with neural nets lecture, Geoffrey Hinton Deep learning course, Bhiksha Raj, CMU VGG Net Deep learning in neural networks: an overview, Juergen Schmidhuber. Scenes - PlaceNet VLFeat code LIBPMK feature extraction code, includes dense sampling LIBSVM library for support vector machines PASCAL VOC Visual Object Classes Challenge Deep learning portal, with Theano tutorials Practical tips: Ilya Sutskever blog post Practical tips: Stanford course notes Practical tips: Bengio paper Colah's blog Deep learning blog iPython notebook for Caffe Tips for Caffe OS X El Capitan	slides 1 Intro to categorization and case studies of discriminative models slides 2 handout slides 2 with links Guest lecture on CNNs, Dinesh Jayaraman	Monday Feb 15, 5-7 pm: Hands on CNN/Caffe tutorial, by Dinesh Jayaraman and Yu-Chuan Su. GDC 4.302 (not the usual classroom) Tutorial slides Tutorial code
Feb 17	Mid-level representations Segmentation into regions, contours, grouping, video segmentation, category-independent object proposals, 3d structure Image credit: Pablo Arbelaez et al.	Constrained Parametric Min-Cuts for Automatic Object Segmentation. J. Carreira and C. Sminchisescu. CVPR 2010. [pdf] [code] Selective Search for Object Recognition. J. Uijilings, K. van de Sande, T. Gevers, A. Smeulders. IJCV 2013. [pdf] [project,code] Discriminatively trained dense surface normal estimation. L. Ladicky, B. Zeisl, M. Pollefeys. ECCV 2014. [pdf] Streaming hierarchical video segmentation. C. Xu, C. Xiong, J. Corso. ECCV 2012. [pdf] [code] Fast SLIC superpixels Greg Mori's superpixel code Berkeley Segmentation Dataset and code Pedro Felzenszwalb's graph-based segmentation code Mean-shift: a Robust Approach Towards Feature Space Analysis [pdf] [code, Matlab interface by Shai Bagon] David Blei's Topic modeling code Berkeley 3D object dataset (kinect)	slides Paper-Chun-Chen Kuo Paper-Andrew Sharp Expt-Kim Houck Expt-Chad Voegele	Coding assignment 2 out Monday Feb 22, due Wed March 9 (with follow up due Thurs March 10)
Feb 24	Object detection Localizing objects within an image, efficient search, part-based models, semantic segmentation, voting, context, objects in scenes Image credit: Felzenszwalb et al.	Rich feature hierarchies for accurate object detection and semantic segmentation. R. Girshick et al. CVPR 2013 [pdf] Contextual Priming for Object Detection, A. Torralba. IJCV 2003. [pdf] [web] [code] A Discriminatively Trained, Multiscale, Deformable Part Model, by P. Felzenszwalb, D. McAllester and D. Ramanan. CVPR 2008. [pdf] [code] Hough Forests for Object Detection, Tracking, and Action Recognition. J. Gall et al. PAMI 2011. [pdf] [code] Labelme Database Scene Understanding Symposium Stanford Event Dataset PASCAL VOC Visual Object Classes Challenge Hoggles	slides Paper-Richard Teammco Paper-Huihuang Zheng Expt-Adam Allevato Expt-William Xie	Tuesday March 1, 11 am: UTCS Distinguished Lecture by Prof. Jim Rehg, Georgia Tech. GDC Auditorium
Mar 2	Attributes and parts Visual properties, learning from natural language descriptions, intermediate shared representations Image credit: Lampert et al.	Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer, C. Lampert, H. Nickisch, and S. Harmeling, CVPR 2009 [pdf] [web] [data] Poselets: Body Part Detectors Trained Using 3D Human Pose Annotations. L. Bourdev and J. Malik. CVPR 2009. [pdf] [code] [web] Relative attributes. Parikh and Grauman. ICCV 2011. [pdf] [code/data] Discovering the spatial extent of relative attributes. F. Xiao and Y. J. Lee. ICCV 2015. [pdf] [code] Animals with Attributes dataset aYahoo and aPascal attributes datasets Attribute discovery dataset of shopping categories Public Figures Face database with attributes Relative attributes data WhittleSearch relative attributes data SUN Scenes attribute dataset Cross-category object recognition (CORE) dataset Leeds Butterfly Dataset FaceTracer database from Columbia Caltech-UCSD Birds dataset Database of human attributes More attribute datasets 2014 Workshop on Parts & Attributes	Paper-Ruohan Gao Paper-Akanksha Saran Paper-Zhuode Liu Expt-Aishwarya Padmakumar Expt-Abhishek Sinha Expt-Ashwini Venkatesh	Thursday March 4, 11 am: Talk by Aditya Khosla, MIT. GDC Auditorium Tuesday March 8, 11 am: Talk by Philipp Krahenbuhl, UC Berkeley. GDC Auditorium
Mar 9	Language and vision Image credit: Antol et al.	VQA: Visual Question Answering. Antol et al. ICCV 2015 [pdf][data/code/demo] Sequence to Sequence - Video to Text. S. Venugopalan et al. ICCV 2015 [pdf] [web] [code] Segment-Phrase Table for Semantic Segmentation, Visual Entailment and Paraphrasing. Izadinia, Sadeghi, Divvala, Hajishirzi, Choi, Farhadi. ICCV 2015 [pdf] Ask Your Neurons: A Neural-Based Approach to Answering Questions About Images. Malinowski, Rohrbach, Fritz. ICCV 2015. [pdf] [video] [code/data]	Paper-Tyler Folkman Paper-Edward Banner Paper-Surbhi Goel Expt-Huihuang Zheng Expt-Kunal Lad Guest speaker: Subha Venugopalan	Project proposal and paper guidelines Tuesday March 22, 11 am: Talk by David Fouhey, CMU. GDC Auditorium
Mar 16	No class - spring break
Mar 23	Low-supervision learning Feature learning, semantics learning. Leveraging free or nearly free cues for supervision. Internet data, video, egomotion, context... Image credit: X. Chen et al.	Learning image representations equivariant to ego-motion. Jayaraman and Grauman. ICCV 2015. [pdf] [web] [slides] [data] NEIL: Extracting Visual Knowledge from Web Data, Chen, Shrivastava, and Gupta, ICCV 2013 [pdf] Learning temporal embeddings for complex video analysis. Ramanathan, Tang, Mori, Fei-Fei. ICCV 2015 [pdf] Unsupervised learning of visual representations using videos. X. Wang and A. Gupta. ICCV 2015. [pdf] [code] [web]	Paper-Hilgad Montelo Paper-Chad Voegele Paper-Bo Xiong Expt-Ashish Bora Expt-Ruohan Gao
Mar 30	Great outdoors Linking and visualizing multi-view data from tourist photos, image-based geolocalization, natural scene text detection, discovering correlated non-visual properties in street-side imagery Image credit: T-Y. Lin et al.	Building Rome in a Day, Agarwala et al. CACM 2011. [pdf] [web] [code] Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition. Jaderberg, Simonyan, Vedaldi, Zisserman. NIPS Deep Learning Workshop, 2014. [pdf] [journal paper] Learning Deep Representations for Ground-to-Aerial Geolocalization. T. Lin, Y. Cui, S. Belongie, and J. Hays. CVPR 2015. [pdf] [poster] [slides] City Forensics: Using VIsual Elements to Predict Non-Visual City Attributes. Arietta, Efros, Ramammoorthy, Agrawala. Trans on Visualization and Graphics, 2014. [pdf] [web] Oxford text recognition datasets CVPR 2009 Workshop on Visual Place Categorization UW Community Photo Collections homepage INRIA Holiday images dataset NUS-WIDE tagged image dataset of 269K images MIRFlickr dataset Code for downloading Flickr images, by James Hays	Paper-Manu Agarwal Paper-Kunal Lad Expt-Zhuode Liu Expt-Ruohan Zhang Expt-Richard Teammco
April 6	3d scenes and objects 3d structure (single views, panoramas, RGBD) and scene layout for visual recognition Image credit: Y. Xiang et al.	PanoContext: A Whole-room 3D Context Model for Panoramic Scene Understanding. Y. Zhang, S. Song, P. Tan, J. Xiao. ECCV 2014. [pdf] [data/code] [slides] Data-Driven 3D Voxel Patterns for Object Category Recognition, Y. Xiang, W. Choi, Y. Lin and S. Savarese, CVPR 2015. [pdf] [web/data] [slides] Indoor Segmentation and Support Inference from RGBD Images. N. Silberman, D. Hoiem, P. Kohli, and R. Fergus. ECCV 2012. [pdf] [code/data] [NYU depth dataset] [slides] KITTI Vision Benchmark NYU Depth Dataset Berkeley 3D Object dataset RGB-D Object Dataset from UW	Paper-Adam Allevato Paper-William Xie Expt-Hilgad Montelo Expt-Chun-Chen Kuo Expt-Andrew Sharp
April 13	Recognition in action Learning how to move for recognition, manipulation. 3D objects and the next best view. Image credit: Malmir et al.	Deep Q-learning for active recognition of GERMS: Baseline performance on a standardized dataset for active learning. Malmir et al. BMVC 2015. [pdf] [data] Active Object Recognition using Vocabulary Trees. N Govender, J. Claassens, P. Torr, J. Warrell. Workshop on Robot Vision, 2013. [pdf] 3D ShapeNets: A Deep Representation for Volumetric Shape Modeling. Wu et al. CVPR 2015. [pdf] [code/data] [slides] GERMS Dataset for active object recognition 3D ShapeNets	Paper-Aishwarya Padmakumar Paper-Ruohan Zhang Paper-Abhishek Sinha Expt-Manu Agarwal Expt-Yinan Zhao
April 20	Noticing and remembering Predicting what gets noticed or remembered in images and video. Saliency, importance, memorability, photography biases. Image credit: T. Liu et al.	Understanding and Predicting Image Memorability at a Large Scale. A. Khosla, S. Raju, A. Torralba, and A. Oliva. ICCV 2015. [pdf] [web] [code/data] Learning video saliency from human gaze using candidate selection. D. Rudoy et al. CVPR 2013 [pdf] [web] [video] [code] Learning to Detect a Salient Object. T. Liu et al. CVPR 2007. [pdf] [results] [data] [code] MIT saliency benchmark Saliency datasets The DIEM Project: visualizing dynamic images and eye movements MIT eye tracking data LaMem Demo LaMem Dataset MSRA salient object database MED video summaries dataset ETHZ video summaries dataset VSUMM dataset for video summarization UT Egocentric dataset / important regions VideoSET summary evaluation data Salient Montages dataset	Paper-Kim Houck Paper-Ashish Bora Expt-Bo Xiong Expt-Akanksha Saran Expt-Tyler Folkman
April 27	Social signals Cues from people in images: body pose, social groups and roles, attention, gaze following, scene structure Image credit: Khosla et al.	Where are they looking? Khosla, Recasens, Vondrick, Torralba. NIPS 2015. [pdf] [demo] [web] People Watching: Human Actions as a Cue for Single View Geometry. Fouhey, Delaitre, Gupta, Efros, Laptev, Sivic. ECCV 2012 [pdf] [journal] [web] [slides] [video] Discovering Groups of People in Images. Choi, Chao, Pantofaru, Savarese. ECCV 2014 [pdf] [web] Face detection code in OpenCV Gallagher's Person Dataset Face data from Buffy episode, from Oxford Visual Geometry Group CALVIN upper-body detector code UMass Labeled Faces in the Wild FaceTracer database from Columbia Database of human attributes Stanford Group Discovery dataset	Paper-Yinan Zhao Paper-Ashwini Venkatesh Expt-Surbhi Goel Expt-Edward Banner	Note April 27/29 deadlines for free poster printing at UTCS See Piazza post for details
May 4	Final project presentations in class	See poster presentation instructions on Piazza.		Final papers and poster reviews due Friday May 6

LDV Vision Challenges

Index of computer vision datasets