This directory contains an simple implementation of Locality-Sensitive
Hashing (LSH) algorithm by Indyk et al. The main functions are (see
documentation in each file for more details):

*** lsh.m - creates the LSH data structure for a given data set.

*** lshprep.m, lshfunc.m - these functions, respectively, set up the LSH
hashing functions and the hash tables based on these functions. You
may not need to call these directly (they are called inside lsh.m)

*** lshins.m - insert data into the LSH data structure. Again, usually you
may not need this function since it, too, is called inside lsh.m, but
you will need it if you need to insert additional data points after
the hash tables are created.

*** lshlookup.m - the approximate similarity search using the LSH. 

*** lshstats.m - examine statistics of LSH data structure

In addition, you can download a separate directory lshtst, containing
a toy data set as well as some hash tables built for it as shown
below. The files in lshtst are

*** patches.mat - sample of 59500 20x20 greylevel image patches

*** tables.mat - a few LSH structures build for those patches, as
  described below. Note: the tables were build with the same parameters as below, 
  but due to randomness of the construction algorithm, the details
  (the hash functions, and consequently numbers of buckets etc.) are
  slightly different than those below.


*********************************

For a simple example, we will use the data in lshtst/patches.m:
59500 20x20 patches, reshaped into 400-dimensional column vectors,
taken from a bunch of motorcycle images. 


>> load patches;

build 20 hash tables, using 24-bit keys, under the simple LSH scheme,
and index the patches array:

>> T1=lsh('lsh',20,24,size(patches,1),patches,'range',255);

You should see, for each hash table, something like this:

 B UNLIMITED 24 bits 20 tables
13390 distinct buckets
Table 1 adding 13390 buckets (now 13390)
Table 1: 59500 elements
10940 distinct buckets
Table 2 adding 10940 buckets (now 10940)
Table 2: 59500 elements
...


(note: the insertion process may take a while...)

This means that the data (to remind you, 59,500 examples) were hashed
into 13,390 buckets in this table. That is good, since it *probably*
means that the distribution of examples into buckets is reasonable,
i.e. most of the buckets contain just a few examples.  What you want
to avoid is the case when the number of buckets is very low, i.e., 200
buckets here would mean that there are, necessarily, some huge
buckets, and at search time those would cancel the efficiency effect
of LSH. If that happens, you probably need to increase the number of
bits in the hash keys (try increasing just by a few, say, from 20 to
25).

The number of hash tables is another parameter to play with. For
example, in this case we can probably decrease it to, say, 10, and
still get good results. Mostly, this is a matter of empirical
performance testing, with the data at hand.

The line:

Table 1: 59500 elements

indicates that the number of distinct examples indexed by this table
is 59,500 (in this case, this means that all the original examples are
indexed in this table). This will always be the case if bucket
capacity is unlimited (B=inf). 

However, if you specify a finite B (see lsh.m), this number may be
somewhat lower than the number of examples; if that happens, this
means that some of the buckets would have more than B examples hashed
into them. This often means trouble (see above about large buckets),
and can be resolved by increasing the number of bits.

You can review the statistics of an LSH data structure in more detail by calling
>> lshstats(T1);
20 tables, 24 bits
Table 1: 59500 in 13390 bkts, med 1, max 2229, avg 518.46
Table 2: 59500 in 10940 bkts, med 1, max 4205, avg 863.20
Table 3: 59500 in 13533 bkts, med 1, max 2435, avg 485.97

This means that, e.g., table 1 has 
* 13,390 buckets, 
* median occupancy is 1 example per bucket, 
* maximum occupancy is 2,229 examples,
* expected occupancy of a bucket is 518.46. Expectation is taken
with respect to a randomly drawn point; thus, with a single huge
bucket, there is a large probability that a point would fall into that
bucket, and thus it contributes a lot to the expected occupancy; this
is not the same as taking the mean of bucket occupancy!

We can analyze the tables in a bit more detail:
>> lshstats(T1,100);
20 tables, 24 bits
Table 1: 59500 in 13390 bkts, med 1, max 2229, avg 518.46, 56 (27607)> 100
Table 2: 59500 in 10940 bkts, med 1, max 4205, avg 863.20, 55 (29437)> 100
Table 3: 59500 in 13533 bkts, med 1, max 2435, avg 485.97, 55 (26600)> 100
Table 4: 59500 in 15199 bkts, med 1, max 6185, avg 997.58, 50 (25616)> 100
Table 5: 59500 in 13089 bkts, med 1, max 2900, avg 486.65, 68 (27759)> 100
Table 6: 59500 in 16312 bkts, med 1, max 6212, avg 997.58, 51 (24616)> 100
Table 7: 59500 in 14974 bkts, med 1, max 2797, avg 565.61, 47 (25827)> 100
Table 8: 59500 in 13302 bkts, med 1, max 2875, avg 566.96, 61 (28507)> 100
Table 9: 59500 in 13649 bkts, med 1, max 3276, avg 705.35, 43 (26762)> 100
Table 10: 59500 in 15024 bkts, med 1, max 2460, avg 518.14, 49 (25488)> 100
Table 11: 59500 in 12071 bkts, med 1, max 3450, avg 647.23, 56 (29121)> 100
Table 12: 59500 in 12821 bkts, med 1, max 2331, avg 533.25, 62 (27829)> 100
Table 13: 59500 in 12717 bkts, med 1, max 3410, avg 595.69, 56 (28003)> 100
Table 14: 59500 in 13822 bkts, med 1, max 3622, avg 581.17, 58 (26503)> 100
Table 15: 59500 in 13647 bkts, med 1, max 4846, avg 819.41, 57 (27857)> 100
Table 16: 59500 in 11779 bkts, med 1, max 1815, avg 470.46, 61 (28286)> 100
Table 17: 59500 in 13478 bkts, med 1, max 4224, avg 695.39, 54 (28806)> 100
Table 18: 59500 in 14067 bkts, med 1, max 2607, avg 543.24, 55 (26250)> 100
Table 19: 59500 in 11399 bkts, med 1, max 4317, avg 844.15, 51 (29223)> 100
Table 20: 59500 in 14701 bkts, med 1, max 3060, avg 555.39, 56 (27413)> 100


This, in addition to the information explained above, tells us that in
Talbe 1, there are 56 buckets with more than 100 elements, holding the
total of 27,607 elements - possibly not good news, since this means
that the distribution of data into buckets may not be very uniform
after all: more than half the data are stuck in relatively few large
buckets. This is likely a consequence of some structure in the data
(i.e., the data themselves are not uniform). The main practical
implication of this is, again, potential increase in time it takes to
look up data in the table. 

One way to address this is, again, to increase k; another way could be to
tell lshlookup to not exhaustively look at all the elements in the
matching bucket (see below). 

Before we do that, we can run an even more detailed diagnostic: test
the performance of the tables on a data set. Below we do that by
running lookup on the first 1000 patches, and asking for at least two NN
(since the reference data contains the same patches, we know that
there is always at least one NN).

>> lshstats(T1,'test',patches,patches(:,1:1000),2);
20 tables, 24 bits
Table 1: 59500 in 13390 bkts, med 1, max 2229, avg 518.46
...
Total 59500 elements
  Running test...10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
  # of comparisons: mean 2554.16, max 9779, failures: 4

So, on average we look at > 2,500  examples for each lookup - about
4.3%. That's not very good. On the other hand, we have almost no failures.

We could rebuild the LSH structure with more bits, but first we will
try something else: reduce the number of tables.

>> lshstats(T1(1:5),'test',patches,patches(:,1:1000),2);
...
  Running test...10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
  # of comparisons: mean 1457.84, max 7699, failures: 53

This shows a typical tradeoff: we now have about 2.5% comparison rate,
but about 5% lookup. failure rate. Going to a few more tables:

>> lshstats(T1(1:10),'test',patches,patches(:,1:1000),2);
...
  Running test...10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
  # of comparisons: mean 1956.60, max 9074, failures: 13

This seems fairly good. Now let's try to increase k:


>> T2=lsh('lsh',20,50,size(patches,1),patches,'range',255);

>> lshstats(T2,'test',patches,patches(:,1:1000),2);
...
  Running test...10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
  # of comparisons: mean 622.79, max 5376, failures: 366

The number of comparisons is much better, but the number of failures
is too high. We can try adding tables:

>> T2(21:40) = lsh('lsh',20,50,size(patches,1),patches,'range',255);
>> lshstats(T2,'test',patches,patches(:,1:1000),2);
...
  Running test...10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
  # of comparisons: mean 925.35, max 6197, failures: 312

Doesn't help much. We could try to increase L further; however, for
now we will just stick with L=10, k=24.

We will now investigate the quality of the lookup. Let the 50-th patch
be the test sample:

>> figure(1);imagesc(reshape(patches(:,50),20,20));colormap gray;axis image

We can search the LSH data structure for the 11 nearest neighbors under the
L1 norm. (we mean 10-NN, but the first one will, of course, be the
50-th patch itself, and we want to find 10 real neighbors):

>> tic; [nnlsh,numcand]=lshlookup(patches(:,50),patches,T2,'k',11,'distfun','lpnorm','distargs',{1});toc
Elapsed time is 0.042590 seconds.


Show the 10-NN:

>> figure(2);clf;
>> for k=1:10, subplot(2,5,k);imagesc(reshape(patches(:,nnlsh(k+1)),20,20)); colormap gray;axis image; end

You can now also do the exhaustive search, and compare the results:
>> tic;d=sum(abs(bsxfun(@minus,patches(:,50),patches)));
[ignore,ind]=sort(d);toc;
Elapsed time is 0.449682 seconds.

>> figure(3);clf;
>> for k=1:10, subplot(2,5,k);imagesc(reshape(patches(:,ind(k+1)),20,20));colormap gray;axis image; end

You should see, among the 10-NN, many of the same examples, although
not necessarily exactly the same, since LSH provides only an approximate search
method. If the results of LSH are significantly inferior to those of
the exact brute-force search, something went wrong along the way.

**************************

We can do the same for the e2lsh scheme:

>> Te=lsh('e2lsh',50,30,size(patches,1),patches,'range',255,'w',-4);

I.e., use 30 projections per function, aiming to divide each projection into 4
intervals (see comments in lshfunc.m).

>> lshstats(Te,'test',patches,patches(:,1:1000),2);
...
  Running test...10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 
  # of comparisons: mean 2294.09, max 11596, failures: 12 


Note: using L2, we can say


>> tic; [nnlsh,numcand]=lshlookup(patches(:,50),patches,Te,'k',11,'distfun','lpnorm','distargs',{2});toc
>> tic;d=sqrt(sum(bsxfun(@minus,patches(:,50),patches).^2)); [ignore,ind]=sort(d);toc;

Note that in this case, neighbors returned by LSH are exactly the same
as the true 10-NN (of course, it is not guaranteed to happen for any
query and any number of neighbors.)