Prateek Jain , Sudheendra Vijayanarasimhan
, Sudheendra Vijayanarasimhan , and Kristen Grauman
, and Kristen Grauman 
  Microsoft Research Lab, Bangalore, INDIA ,
,
  University of Texas, Austin, TX, USA 
 .
.
 
 
Problem: With massive unlabeled pool, we cannot afford exhaustive linear scan.
 
Main contributions:
 be a distance function over items from a set
 be a distance function over items from a set  , and for any item
, and for any item  , let
, let  denote the set of examples from
 denote the set of examples from  within radius
 within radius  from
 from  .
.
Definition:
Let 
 denote a random choice of a hash function from the family
 denote a random choice of a hash function from the family  . The family
. The family  is called
 is called 
 sensitive for
sensitive for 
 when, for any
 when, for any 
 ,
,
 then
 then 
![$ \Pr [ h_{\cal{H}}(q) = h_{\cal{H}}(p)] \geq p_1$](img16.png) ,
,
 then
 then 
![$ \Pr[ h_{\cal{H}}(q) = h_{\cal{H}}(p)] \leq p_2$](img18.png) .
.
Algorithm:
 -bit hash keys for each point
-bit hash keys for each point  :
: 
![$ \left[h_{\cal{H}}^{(1)}(p_i),h_{\cal{H}}^{(2)}(p_i),\dots,h_{\cal{H}}^{(k)}(p_i)\right]$](img21.png) .
.
   , search over examples in the
, search over examples in the  buckets to which
 buckets to which  hashes.
 hashes.
  
 hash tables for
 hash tables for  points, where
 points, where 
 ,
,
 -approximate solution is retrieved in time
-approximate solution is retrieved in time 
 .
.
  
 is small, we want collisions to be probable for vectors perpendicular to hyperplane normal (assuming normalized data).
 is small, we want collisions to be probable for vectors perpendicular to hyperplane normal (assuming normalized data).
 
 ,
, 
 sign
sign sign
sign![$ (\u ^T\mathbf{x})]=\frac{1}{\pi} \theta_{\mathbf{w},\mathbf{x}}$](img33.png) [Goemans & Williamson, 1995].
 [Goemans & Williamson, 1995].
Our idea: Generate two independent random vectors  and
 and  : one to capture angle between
: one to capture angle between 
 and
 and 
 , and one to capture angle between
, and one to capture angle between 
 and
 and 
 .
. 
 
    
Definition: We define H-Hash function family 
 as:
 as:
|  | 
 sign
sign sign
   sign![$ (\v ^T\b )],$](img43.png) is a two-bit hash, and
 is a two-bit hash, and 
 .
.
Analysis:
 and
 and 
 is given by
 is given by
 
 in sub-linear time
 in sub-linear time 
 .
.
Intuition:  Design Euclidean embedding after which minimizing distance is equivalent to minimizing 
 , making existing approx. NN methods applicable.
, making existing approx. NN methods applicable.
Definition: We define EH-Hash function family 
 as:
 as:
|  | 
![$ V(\mathbf{a}) = {\text vec}(\mathbf{a}\mathbf{a}^T)=\left[a_1^2, a_1a_2, \dots, a_1a_d,  a_2^2, a_2a_3, \dots,  a_d^2\right]$](img50.png) gives the embedding, and
 gives the embedding, and 
 sign
sign , with
, with 
 sampled from
 sampled from 
 .
.
Analysis:
 , distance between embeddings of
, distance between embeddings of 
 and
 and 
 proportional to desired distance, so standard LSH function
 proportional to desired distance, so standard LSH function 
 applicable.
 applicable.
 . Hence, sub-linear time search with about twice the
. Hence, sub-linear time search with about twice the  guaranteed by H-Hash.
 guaranteed by H-Hash.
Issue: 
 is
 is  -dimensional, higher hashing overhead.
-dimensional, higher hashing overhead.
Solution: Compute 
 approximately using randomized sampling:
 approximately using randomized sampling:
| Accuracy | Hashing insertion time | |
| H-Hash: |  |  | 
|---|---|---|
| EH-Hash: |  |  (  with sampling) | 
 
 
 
 
Hashing Hyperplane Queries to Near Points with Applications to Large-Scale Active Learning, 
P. Jain, S. Vijayanarasimhan and K. Grauman, in NIPS 2010 
[paper, supplementary]