**Prateek Jain, Sudheendra Vijayanarasimhan, and Kristen Grauman**
* Microsoft Research Lab, Bangalore, INDIA,*

University of Texas, Austin, TX, USA

Margin-based selection criterion for SVMs [Tong & Koller, 2000] selects points nearest to current decision boundary:

Problem: With massive unlabeled pool, we cannot afford exhaustive linear scan.

- Offline: Hash unlabeled data into table.
- Online: Hash current classifier as ``query" to directly retrieve next examples for labeling.

Main contributions:

- Novel hash functions to map query hyperplane to near points in sub-linear time.
- Bounds for locality-sensitivity of hash families for perpendicular vectors.
- Large-scale pool-based active learning results for documents and images, with up to one million unlabeled points.

Definition:

Let denote a random choice of a hash function from the family . The family is called sensitive for when, for any ,

- if then ,
- if then .

Algorithm:

- Compute -bit hash keys for each point : .
- Given a query , search over examples in the buckets to which hashes.
- Use hash tables for points, where ,
- A -approximate solution is retrieved in time .

Our idea: Generate two independent random vectors and : one to capture angle between and , and one to capture angle between and .

Definition: We define H-Hash function family as:

where sign sign is a two-bit hash, and .

Analysis:

- Probability of collision between
and
is given by
- Hence, can return a point for which in sub-linear time .

Intuition: Design Euclidean embedding after which minimizing distance is equivalent to minimizing , making existing approx. NN methods applicable.

Definition: We define EH-Hash function family as:

where gives the embedding, and sign, with sampled from .

Analysis:

- Since , distance between embeddings of and proportional to desired distance, so standard LSH function applicable.
- We have . Hence, sub-linear time search with about twice the guaranteed by H-Hash.

Issue: is -dimensional, higher hashing overhead.

Solution: Compute approximately using randomized sampling:

H-Hash has faster pre-processing, but EH-Hash has stronger bounds.

Accuracy | Hashing insertion time | |

H-Hash: | ||
---|---|---|

EH-Hash: | ( with sampling) |

Newsgroups: 20K documents, bag-of-words features.

Tiny Images: 60K-1M images, Gist features.

- Accounting for both selection and labeling time, our approach performs better than either random selection or exhaustive active selection.
- Trade-offs confirmed in practice: H-Hash faster, EH-Hash more accurate.
- In future work, we plan to explore extensions for non-linear kernels.

Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning,

P. Jain, S. Vijayanarasimhan and K. Grauman, in NIPS 2010

[paper, supplementary]