Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning

Prateek Jain, Sudheendra Vijayanarasimhan, and Kristen Grauman
Microsoft Research Lab, Bangalore, INDIA,
University of Texas, Austin, TX, USA

Motivation

Goal: For large-scale active learning, want to repeatedly query annotators to label the most uncertain examples in a massive pool of unlabeled data $\mathcal{U}$ .

Margin-based selection criterion for SVMs [Tong & Koller, 2000] selects points nearest to current decision boundary: $\mathbf{x}^*=\argmin_{\mathbf{x}_i\in {\mathcal U}}\vert\mathbf{w}^T\mathbf{x}_i\vert$

Problem: With massive unlabeled pool, we cannot afford exhaustive linear scan.

Main Idea: Sub-linear Time Active Selection

Idea: We define two hash function families that are locality-sensitive for the nearest neighbor to a hyperplane query search problem. The two variants offer trade-offs in error bounds versus computational cost.

Offline: Hash unlabeled data into table.
Online: Hash current classifier as ``query" to directly retrieve next examples for labeling.

Main contributions:

Novel hash functions to map query hyperplane to near points in sub-linear time.
Bounds for locality-sensitivity of hash families for perpendicular vectors.
Large-scale pool-based active learning results for documents and images, with up to one million unlabeled points.

Background: Locality-Sensitive Hashing (LSH)

Let $d(\cdot,\cdot)$ be a distance function over items from a set

, and for any item $p \in S$ , let

denote the set of examples from

within radius

from

Definition:

Let $h_{\cal{H}}$ denote a random choice of a hash function from the family $\cal{H}$ . The family $\cal{H}$ is called $(r, r (1+\epsilon), p_1, p_2)-$ sensitive for $d(\cdot,\cdot)$ when, for any $q, p \in S$ ,

if $p \in B(q, r)$ then $\Pr [ h_{\cal{H}}(q) = h_{\cal{H}}(p)] \geq p_1$ ,
if $p \notin B(q, r(1+\epsilon))$ then $\Pr[ h_{\cal{H}}(q) = h_{\cal{H}}(p)] \leq p_2$ .

Algorithm:

Compute -bit hash keys for each point : $\left[h_{\cal{H}}^{(1)}(p_i),h_{\cal{H}}^{(2)}(p_i),\dots,h_{\cal{H}}^{(k)}(p_i)\right]$ .
Given a query , search over examples in the buckets to which hashes.
- Use $l=N^\rho$ hash tables for points, where $\rho= \frac{\log p_1}{\log p_2} \leq \frac{1}{1+\epsilon}$ ,
- A $(1+\epsilon)$ -approximate solution is retrieved in time $O(N^{\frac{1}{(1+\epsilon)}})$ .

First Solution: Hyperplane Hash

Intuition: To retrieve those points for which $\vert\mathbf{w}^T \mathbf{x}\vert$ is small, we want collisions to be probable for vectors perpendicular to hyperplane normal (assuming normalized data).

For $\u\sim \mathcal{N}(0,I)$ , $\Pr[$ sign $(\u ^T\mathbf{w})\neq$ sign $(\u ^T\mathbf{x})]=\frac{1}{\pi} \theta_{\mathbf{w},\mathbf{x}}$ [Goemans & Williamson, 1995].

Our idea: Generate two independent random vectors $\u$ and $\v$ : one to capture angle between $\mathbf{w}$ and $\mathbf{x}$ , and one to capture angle between $-\mathbf{w}$ and $\mathbf{x}$ .

Definition: We define H-Hash function family $\mathcal{H}$ as:

$\displaystyle h_{\mathcal{H}}(\mathbf{z})= \begin{cases}h_{\u , \v }(\mathbf{z}... ...\mathbf{z}), &\text{if $\mathbf{z}$ is a query hyperplane vector.} \end{cases}$

where $h_{\u , \v }(\mathbf{a},\b )=[$ sign $(\u ^T\mathbf{a}),$ sign $(\v ^T\b )],$ is a two-bit hash, and $\u ,\v\sim \mathcal{N}(0,I)$ .

Analysis:

Probability of collision between $\mathbf{w}$ and $\mathbf{x}$ is given by

$\displaystyle \Pr[h_{\cal H}(\mathbf{w})=\frac{1}{4}-\frac{1}{\pi^2}\left(\theta_{\mathbf{x}, \mathbf{w}}-\frac{\pi}{2}\right)^2$
Hence, can return a point for which $(\theta_{\mathbf{x}, \mathbf{w}}-\frac{\pi}{2})^2 \leq r$ in sub-linear time $O(N^\rho), \rho<1$ .

Second Solution: Embedded Hyperplane Hash

Intuition: Design Euclidean embedding after which minimizing distance is equivalent to minimizing $\vert\mathbf{w}^T \mathbf{x}\vert$ , making existing approx. NN methods applicable.

Definition: We define EH-Hash function family $\mathcal{E}$ as:

$\displaystyle h_{\mathcal{E}}(\mathbf{z})= \begin{cases}h_{\u }\left(V(\mathbf{... ...{z})\right), &\text{if $\mathbf{z}$ is a query hyperplane vector,} \end{cases}$

where $V(\mathbf{a}) = {\text vec}(\mathbf{a}\mathbf{a}^T)=\left[a_1^2, a_1a_2, \dots, a_1a_d, a_2^2, a_2a_3, \dots, a_d^2\right]$ gives the embedding, and $h_{\u }(\b )=$ sign $(\u ^T\b )$ , with $\u\in \Re^{d^2}$ sampled from $\mathcal{N}(0,I)$ .

Analysis:

Since $\vert\vert V(\mathbf{x}) - (-V(\mathbf{w}))\vert\vert^2 = 2+2 (\mathbf{x}^T\mathbf{w})^2$ , distance between embeddings of $\mathbf{x}$ and $\mathbf{w}$ proportional to desired distance, so standard LSH function $h_{\u }(\cdot)$ applicable.
We have $p_1=\frac{1}{\pi}\cos^{-1} \sin^2(\sqrt{r})$ . Hence, sub-linear time search with about twice the guaranteed by H-Hash.

Issue: $V(\mathbf{a})$ is -dimensional, higher hashing overhead.

Solution: Compute $h_{\u }(V(\mathbf{a}))$ approximately using randomized sampling:

Trade-off

H-Hash has faster pre-processing, but EH-Hash has stronger bounds.

	Accuracy	Hashing insertion time
H-Hash:	$p_1=\frac{1}{4}-\frac{r}{\pi^2}$	$\propto d$
EH-Hash:	$p_1 \ge 2 \left(\frac{1}{4}-\frac{r}{\pi^2}\right)$	$\propto d^2$ ( with sampling)

Experimental Results

Goal: Show that proposed algorithms can select examples nearly as well as the exhaustive approach, but with substantially greater efficiency.

Newsgroups: 20K documents, bag-of-words features.

Tiny Images: 60K-1M images, Gist features.

Accounting for both selection and labeling time, our approach performs better than either random selection or exhaustive active selection.
Trade-offs confirmed in practice: H-Hash faster, EH-Hash more accurate.
In future work, we plan to explore extensions for non-linear kernels.

Publication

Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning,
P. Jain, S. Vijayanarasimhan and K. Grauman, in NIPS 2010
[paper, supplementary]