Learning the Easy Things First: Self-Paced Visual Category Discovery


Yong Jae Lee and Kristen Grauman

University of Texas at Austin






Objects vary in their visual complexity, yet existing discovery methods perform “batch” clustering, paying equal attention to all instances simultaneously---regardless of the strength of their appearance or context cues.  We propose a self-paced approach that instead focuses on the easiest instances first, and progressively expands its repertoire to include more complex objects.  Easier regions are defined as those with both high likelihood of generic objectness and high confidence of surrounding familiar object context.  At each cycle of the discovery process, we re-estimate the easiness of each subwindow in the pool of unlabeled images, and then retrieve a single prominent cluster from among the easiest instances.  Critically, as the system gradually accumulates models, each new (more difficult) discovery benefits from the context provided by earlier discoveries.  Our experiments demonstrate the clear advantages of self-paced discovery relative to conventional batch approaches, including both more accurate summarization as well as stronger predictive models for novel data.                      





Our approach iterates over four main steps:


(1)  Identifying the easy instances among the image regions in U.

(2)  Discovering the next prominent group of easy regions.

(3)  Training a model with the discovered category to detect harder instances in the data, moving results to D.

(4)  Revising the object-level context for all regions in U according to the most recent discovery.


D :  the discovered windows that have been assigned to a cluster.

U :  the undiscovered windows that remain in the general unlabeled pool.

Ct : the set of familiar categories at time t.


Identifying Easy Objects




We identify the easiest instances among U according to both low-level image properties and the current familiar classes in C.  We define an “easiness” function

that scores a window w based on how likely it is to contain an object (“objectness”, Obj) and to what extent it is surrounded by familiar objects (“context-awareness”, CA).  We compute “objectness” using the measure developed in [Alexe et al., CVPR 2010].  We compute “context-awareness” by averaging the familiarity scores (i.e., likelihood of belonging to a familiar category) of the surrounding superpixels, giving more weight to those that are spatially near. 


The figure on the right shows randomly selected examples among the easiest and hardest instances chosen by our algorithm.  Our method is able to bypass the hard or noisy instances, and focus on the easiest ones first.  Note that a region with high objectness can yield low easiness if its context is yet unfamiliar (e.g., see red boxes in (b)).


Single Prominent Category Discovery


To compute the similarity between two windows wi and wj, we use the combined kernel:


Under this kernel, easy instances with both similar appearance and context are most likely to be grouped together.



To describe appearance A, we use color histograms, spatial pyramid bag-of-words, and pHOG.  To describe context Gt at iteration t, we use our object-graph descriptor [Lee and Grauman, CVPR 2010].  The descriptor expands at each iteration to reflect the most recent discovered category.                 


At each iteration, we want to expand the pool of discovered categories with a single “prominent” cluster.  By seeking a single new cluster, we can conservatively identify the most obvious new group; further, we can incrementally refine the context model most quickly for future discoveries.  We perform complete-link agglomerative clustering and stop merging when the distance between the most similar (yet-unclustered) instances becomes too large.  Among these results, we select the “best” cluster based on its silhouette coefficient and refine the selected instances with Single-Cluster Spectral Graph Partitioning (SCSGP) [Olson et al., RSS 2005].


Discovered Category Knowledge Expansion




Intra-Category Model Expansion


The initially discovered easier instances yield a model that can be used to detect the harder instances, which would not have clustered well due to their appearance or different context.  We use instances in the newly discovered category to train a one-class SVM based on their appearance representation (no context).  Then, we apply the classifier to all remaining windows in U, merge the positively classified instances with the discovered category, and move them to D.



Object-Level Context Expansion


The expansion of the context model based on the discovered categories can help to discover certain harder ones.  With each discovery, the set of familiar categories expands.  Thus, for every window remaining in U, we revise its object-graph Gt to form Gt+1, augmenting it with class affinities for the discovered category, per spatial component.  This enriches the object-level context, altering both the feature space and the easiness scores.


Iterative Discovery Loop


Finally, having augmented Ct with the newly discovered category, we proceed to discover the next easiest category.   Note that the easiness scores evolve at each iteration of the discovery loop as more objects become familiar.  Further, we loosen the “easiest” criterion over time, which gives the algorithm the opportunity to discover harder categories in later iterations, when context models are potentially richer.  We iterate the process until all remaining instances are too hard.






We tested our method on the MSRC-v0 dataset, which contains 21 object classes.  The figure above shows cluster purity as a function of percentage of unique object instances discovered.  We compare our method to a batch clustering baseline, and to a “hardest first” baseline that focuses on the hardest instance first but otherwise follows our pipeline.  Our approach produces significantly more accurate clusters than either baseline, while selectively ignoring instances that cannot be grouped well.


*To ensure that the batch baseline is competitive, we give it the non-overlapping windows with the highest objectness score per image.




This figure shows examples of the discovered categories.  The numbers indicate the iteration when that discovery was made.





The bar plot shows comparison to [Russell et al., CVPR 2006] and [Lee and Grauman, CVPR 2010], which are state-of-the-art batch discovery algorithms.  By accounting for the easiest instances first, our method summarizes the data more accurately than either baseline.




Finally, the table shows how well our discovered categories generalize to novel images.  We train one-vs-one SVM classifiers for all discovered categories using the averaged appearance kernels.  To coarsely simulate obtaining labels from a human annotator, we label all instances in a cluster according to the ground-truth majority instance.  In addition to the baselines from above, we compare to two “upper bounds” in which the ground truth labels on all instances are used to train a nearest-neighbor (NN) and SVM classifier. 


As expected, the fully supervised methods provide the highest accuracy, yet at the cost of significant human effort (one label per training window).  On the other hand, our method requires a small fraction of the labels (one per discovered category), yet still achieves accuracy fairly competitive with the supervised methods, and substantially better than either the batch or hardest-first baselines.




Learning the Easy Things First: Self-Paced Visual Category Discovery [pdf]

Yong Jae Lee and Kristen Grauman
To appear, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Colorado Springs, CO, June 2011.